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THE CONCEPT OF VALIDITY IN MENTAL AND 
ACHIEVEMENT TESTING 


AUSTIN H. TURNEY 
University of Kansas 


In the field of mental testing and in achievement testing also, the 
term validity has received varying definitions, and numerous methods 
have been used to determine validity. Furthermore there has been 
no very marked tendency to distinguish validity from other properties 
of a test such as discrimination, and reliability. It is common knowl- 
edge that validity has been defined as “that property of a test by 
virtue of which it measures what it purports to measure’’; and also 
as ‘‘the general worthwhileness of a test.” 

The methods used to determine validity are frequently quite 
different. According to certain writers one may use, as an evidence 
of validity, the size of the standard deviation; the coefficient of corre- 
lation with other tests supposedly valid; or the fact that the items 
in the test have supposedly been selected from the field to be measured, 
and others. Some of these methods can be shown to have no genuine 
value as methods of validation. 


This paper seeks to justify a single definition of validity and a) 
single criterion for judging validity, and to consider some of the “ 


questions that would arise from such limitations. The application 
of an exclusive definition and an exclusive criterion to mental testing 
involves points common to both fields, and these will be considered 
first, but in the field of achievement testing some special problems 
arise and ‘these will be considered at length in the latter part of the 
paper. 


I. APPLICATION TO MENTAL TESTING 


Selection of items from the field to be measured is not a new 
concept in mental measurement. It has been employed since the 
81 
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testing movement began. if the mental tester did not always state 
his purpose as such, at least in practice this often occurred. Thorn- 
dike expressed it when he wrote: ‘‘The present status of such instru- 
ments . . . is roughly as follows: We have chosen tests where the 
judgment of sensible people in general is that correct response or 
speed of correct response is characteristic of intellect.”! In such a 
statement the “‘field’”’ of intellect is recognized and the criterion of 


‘validity is selection of items from the field to be measured, the tech- 


nique of selection being the “‘judgment of sensible people.” 
Yet it is obvious to any one who has reviewed the definitions set 
forth in such symposia as that in the Journal of Educational Psychology 


that the field of intellect is never quite the same to any two authorities 


and hence the selection of items from the field yields widely different 
tests.” 

The early workers especially included in their tests tasks calling 
into play different abilities or functions, each tester depending upon 
his own interpretation as to what was intelligent behavior. While 
there was a rather general tendency to accept the definition that 
validity was ‘‘that characteristic of a test by virtue of which the test 
measures what it purports to measure,” there was no complete agree- 
ment as to what it purported to measure, or if there was a seeming 
agreement in definition, there was no agreement as to the delimitation 
of the field of “‘intellect.’’ 

It is not surprising, therefore, that since the test makers failed to 
agree upon the nature of intelligence or any definition of it; and since 
no adequate attempt has been made to delimit the field to be tested, 
various definitions of validity have become accepted; and methods of 
validation came into use other than that of selecting items from the 
field to be tested.* 

When in the case of mental tests we define validity as that property 
of the test by virtue of which it measures what it purports to measure, 
the definition of validity appears to be closely linked with the definition 
of intelligence. Hence, according to such a definition it would seem 
impossible to make a valid test since ‘‘what the test purports to 
measure” cannot be defined. It would then follow that the field 





1 Thorndike, E. L.: ‘‘The Measurement of Intelligence.”’ 1925. 

2 Intelligence and Its Measurement: A Symposium. Journal of Educational 
Psychology, March, 1921. 

’Sangren, Paul V.: Comparative Validity of Primary Intelligence Tests. 
Journal of Applied Psychology, Vol. XIII, 1929, pp. 394-412. 
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could not be delimited, and hence the criterion of validity—selection of 
items from the field to be measured—could not be applied exclusively. 

Yet this difficulty is not insurmountable. We need not assume 
that the test measures “‘intelligence.”” In Spearman’s theory may 
be found not only an escape from the differences of opinion regarding 
‘intelligence,’ but also a technique for selecting test items from a 
delimited field.!. According to this theory the tests purport to measure 
‘“g” and the items of a mental test are selected from the field of 
measurable behavior known to contain ‘‘g.’”? 

Spearman’s theory represents the most notable and thorough- 
going attempt yet made to delimit the field in which (‘‘g’’) functions. 
This delimitation has been made both qualitatively and quanti- 
tatively.2 The apparent difference between Spearman and Kelley‘ 
does not seem to be unreconcilable.’ Likewise Spearman’s theory 
most closely agrees with the findings of neurology.* And in our judg- 
ment it offers the one explanation fitting most closely the phenomena 
observed in the school room.’ One might say, therefore, that this 
theory is most acceptable in the light of statistical evidence, neurolog- 
ical evidence, and application. 

The importance of Spearman’s theory to this discussion lies in the 
fact that we have here a very definite and deliberate attempt at a 
systematic delimitation of the field and a selection of items from that 





1 (a) Spearman, C.: ‘‘The Nature of Intelligence and the Principles of Cogni- 
tion.” London, Macmillan and Co., Ltd., 1923. 

(b) Spearman, C.: ‘‘The Abilities of Man.”” Macmillan. 

2? For a complete explanation of “‘g’” the reader must refer to Spearman’s 
“ Abilities of Man” and also “‘The Nature of Intelligence and the Principle of 
Cognition.”’ It may be said here that ‘‘g”’ is that factor common to the usual set of 
mental tests. Spearman does not say what it is but suggests that it is ‘‘mental 
energy.” As will be seen later “‘g”’ finds its greatest activity in relational thinking. 

’ Spearman, C.: ‘‘The Nature of Intelligence and the Principles of Cognition.” 
London, Macmillan and Co., Ltd., 1923. 

‘ Kelley, T. L.: “‘Crossroads in the Mind of Man: A Study of Differentiable 
Mental Abilities.” Stanford University, California, Stanford University Press, 
1928. 

5 (a) Line W.: Three Recent Attacks on Associationism. Journal of General 
Psychology, Vol. V, October, 1931, pp. 495-513. 

(b) Holzinger, K. O.: Tetrad Differences with Overlapping Variables. Jour- 
nal of Educational Psychology, Vol. XX, February, 1929, pp. 91-97. 

® Lashley, K. S.: “‘Brain Mechanisms and Intelligence.”’ Chicago, University 
of Chicago Press, 1929, pp. 11. 

7Turney, A. H.: Intelligence Motivation and Achievement. Journal of 
Educational Psychology, Vol. XXII, September, 1931, pp. 426-434. 
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field in accordance with a statistical technique which permits a deter- 
mination of the amount or proportion of the thing to be measured 
—that is “‘g’”—which is present in the particular kind of tasks selected, 
or the extent to which it correlates with that particular kind of task. 
To illustrate, Holzinger presents the following correlations as between 


*“‘g” and the various tests in the case of Bonser’s Study.! 





1. Mathematical judgment........................... Tig .707 
2. Controlled association......................2-0000- 2% .673 
3. Literary interpretation............................. Tag .605 
EE a Tt .554 
I Siva aE USA tins wikis da ee ee ss wane ede 59 .398 











Spearman has clearly pointed out that his doctrine has as one 
of its chief utilities a direct application in the construction of mental 
tests. ‘“‘We are enabled to ascertain just the degree of accuracy 
with which any given test, or series of tests, will measure either a 
person’s ‘g’ or any of his S’s.’’? In the construction of mental tests 
the best weighted pool of separate tests can be determined. Thus 
in the case of the above five Holzinger says that the best weighted pool 
of these five tests gives a correlation with ‘‘g” of .867. He has 
presented other data indicating the possibility of much higher correla- 
tions between a best weighted pool and “‘g.’”’ The difference between 
the process used by Spearman and the ordinary method of test con- 
struction is that in the case of Spearman’s procedure items would 
not be selected from a field having little or no correlation with ‘‘g” 
and second, that the correlation between a group of similar items, 
such as mathematical judgment, and “g,’”’ can be determined. In 
the older method there was no particular guarantee that the test 
items were saturated with what the test purported to measure. 
Asher’s report indicates that in practice good results are obtained 
by the application of Spearman’s technique to the construction 
of mental tests.’ 





1 (a) Holzinger, Karl: ‘The Application of Spearman’s Methods to the Con- 
struction of Intelligence Tests.’ Report of the Conference on Individual Psycho- 
logical Differences. Appendix A-1, National Research Council, 1930, pp. 9. 

(b) Spearman, C.: ‘‘The Nature of Intelligence and the Principles of Cogni- 
tion.” Pp. 147. 

2 Spearman, C.: “‘The Abilities of Man.”’ Pp. 77. 

* Asher, E. J.: The Predictive Value of Mental Tests That Satisfy Spearman’s 
Tetrad Criterion. Journal of Applied Psychology, Vol. XIII, April, 1929, pp. 
152-158. 
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We believe that the pragmatic attitude of assuming that since 
the tests work we need not bother about what they test is not justifia- 
ble. In view of the imposing evidence favoring the Spearman theory 
and its application such a position is no longer tenable. To apply our 
solitary criterion of validity will necessarily force a recession from the 
pragmatic attitude. We will have to unite upon a delimitation of the 
field as Spearman has done, in order that the items may be selected 
from the field to be measured. Not to do so will permit each tester 
to set up his own definition though one vary from another as widely 
as the poles. This would continue the old confusion of definition 
and methods of validating with a resultant possible retardation of the 
advancement of measurement. It would mean that any definition 
of “intelligence” will be sufficient, that any sort of test will do as 


long as it tests something which the maker of the test thinks is 
“intelligence.”’ 


Il. APPLICATION TO ACHIEVEMENT TESTING 


A. Critical Analyses of Existing Concepts.—The problem of validat- 
ing achievement tests has long been recognized as different from that 
of validating mental tests. The fact that definitions of validity and 
methods of validating peculiarly applicable to achievement testing 
have been set up shows this to be true. Nevertheless it is in the field 
of achievement testing that this concept—that true validity is depend- 
ent upon the selection of items from the field to be measured—offers, 
probably, its greatest utility. Unlike the case of mental testing, 
school achievement apparently does not involve a single general factor 
determinable by a technique similar to the tetrad equation. Hence 
more than in the case of mental tests the field must be delimited by 
experts. Once the field is delimited, the selection of material for a 
curriculum (and hence for testing) must then involve consideration 
of the “‘quality” of the valid items. In this step, as will be seen later, 
the concept of ‘‘g’? may play an important réle. 

Our discussion of validity in relation to achievement tests can 
best be introduced by quoting and discussing some of the definitions 
and methods that have received attention. It must be recognized 
that any criticism made is for the purpose of elucidating our argument 
and not to disparage very able presentations. 

Ruch has listed several definitions of validity: 


1. Validity is the degree to which a test or examination measures what it is 
intended to measure. 
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2. Validity is the general worthwhileness of an examination. 

3. Validity refers to the care taken to incorporate in a test or examination 
those elements or items which are of prime importance, and to the pains taken to 
eliminate the non-essential. 


4. Validity is in general the degree to which a test parallels the curriculum and 
good teaching practice. 

5. Validity refers to the value of the test for measuring specific abilities in an 
accurate fashion, and a test ceases to have validity when applied to the measure- 
ment of abilities for which it was not intended.'! 


The definitions quoted are not all in agreement. The first one is 
practically identical with the definition we are elucidating. The 
general worthwhileness of a test may justifiably be considered to 
not only include validity but also discrimination, reliability, and 
objectivity. The third of these definitions is not, strictly speaking 
a definition of validity. Items may be of varying value considered 
from different points of view, yet they may all be valid or none be 
valid. One could hardly say that the ability to extract square root 
was not an arithmetic ability, and therefore a test item measuring 
this phase of arithmetic ability would be valid. Yet its utility for 
the majority of pupils may be very slight. Just because an item 
seems to be non-essential when considered from a utilitarian stand- 
point or a social point of view by no means renders it invalid. In 
achievement testing particularly we have too long confused with 
validity an entirely separate aspect of the problem, namely the quality 
of valid items, or more broadly, the quality of different parts of a 
delimited field. 

The fourth definition is very nearly the same as our suggested 
one. There isa possibility of confusion, however, unless one recognizes 
that when good teaching practice requires the selection of material 
for teaching on the basis of utility, this selection is no longer strictly 
one of validity (as we have just indicated). However, good teaching 
practice will have to face the problem of what arithmetic ability is, 
and the relative importance of problem solving and reproduction 
in the child’s development regardless of immediate utility. This 
phase of the question of validity is of prime importance and will be 
treated in detail later. 

Definition number five is acceptable if the word specific refers 
to a given delimited field. The last part of this definition is calculated 
to throw into relief the major problem we are discussing. Without 





1 Ruch, G. M.: ‘The Objective or New Type Examination.” Scott, Forsman 
& Co., 1929, pp. 27-28. 
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delimitation of the field one can never know when the instrument is 
measuring abilities for which it was not intended. 
As methods of validating tests this writer cites the following: 


1. By judgments of competent persons. 

2. By analysis of courses of study or text-books. 

3. By harmonizing with the recommendations of national educational com- 
mittees or other recognized bodies on curricula, courses of study, minimum essen- 
tials, etc. 

4. By experimental studies of social utility (such as the Horn and Thorndike 
studies of the most frequently used words, the Ashbaugh and Horn studies of 
spelling lists, the studies of Wilson, Woody, et al., on arithmetic needs of business, 
etc.). 

5. By studies of the most frequently recurring errors. 

6. By computation of the percentages of pupils answering each item cor- 
rectly at each successive age or grade level. 

7. By correlation against an outside criterion. 


To these may be added from Ruch and Stoddard:! 


8. Analysis of final examination questions. 
9. Use of rating scales in setting up criteria. 
10. Correlations with school marks or other measures of success. 
11. Differential scores shown by two groups known to be widely separated 
upon a scale of ability. 
12. Logical or psychological analysis. 
13. Correlations with tests of other intellectual, or non-intellectual, or educa- 
tional abilities. 


Of these, the majority are not true methods according to our 
single criterion. The first three may be interpreted as conforming 
very closely. 

The others may or may not conform. A brief characterization 
will clarify this point. We have already shown that an item or group 
of items may be perfectly valid and yet not satisfy any criterion of 
utility. Hence the fourth method is not a method of validation. 

Studies of the most frequently recurring erro:3, or studies of 
difficulty by the method of percentage of failures or passes are not 
at all criteria of validity. They serve only to determine the dis- 
criminative value of an item or of a test. No evidence of discrimina- 
tion or difficulty will serve to identify any given test as a test of ability 
X. But once the field to be tested has been delimited, then selection 





1 Ruch, G. M. and G. D. Stoddard: ‘‘Tests and Measurement in High School 
Instruction.’”” World Book Co., Yonkers on Hudson, N. Y., pp. 302. 
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of items from that field will always identify the test as a test of ability 
X, and consequently, as a valid test. 

Correlations with outside criteria are not evidence of validity. 
Such procedure merely throws us back upon the outside criterion, 
which itself must be validated. This validation ultimately is forced 
back to the definition and criterion we have suggested. To use 
correlations with any outside criterion such as tests of intellectual, 
non-intellectual, or educational abilities (number 13 above) or school 
marks (number 10) does not establish validity. The correlations 
may be positive if these abilities happen to be necessary to successful 
achievement in the field being measured. 

Analysis of final examination questions would serve as a criterion 
of validity only if these questions tested the content of a previously 
delimited field. Logical or psychological analysis may be satisfactory 
if made in accordance with some guiding principle such as we have 
set up. 

There remains of the criteria quoted that of the use of rating scales 
in setting up criteria. Rating scales may serve to facilitate the refining 
of judgments of experts or the determination of the concensus of 
judgments. So used they are desirable aids to the process of valida- 
tion. They are not criteria in themselves. 

Two other measures sometimes erroneously used as evidence of 
validity are: 

1. The size of the standard deviation. 

2. Conformity to the normal curve. 

They, too, are properly methods of determining the discriminating 
capacity of a test or item and would serve equally well in any field of 
measurement. Hence they do not serve to prove that any given fact, 
concept, skill, or problem belongs to general science rather than to 
Latin. They have no value whatever for proving that such and such 
items included in a given scale for the measurement of chemistry 
belong or do not belong to the field of chemistry. 

It is evident that human judgment must in the last analysis 
determine the field and delimit it. Validity is by its very nature 
determinable by no other means and the only statistical treatment which 
4s essential to the establishment of validity is that which will refine or 
assist in the concensus of opinion of experts. Whether a given fact, 
concept, or skill is a part of general science is not a question of utility, 
‘statistical determination of “difficulty,” or correlation with other 
abilities of students in a given field. 
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B. The Quality of Valid Items——As has been indicated, once a 
field has been delimited, selection of items from this field either for 
teaching or for testing will involve consideration of quality, that is 
of utilitarian and developmental values. From the standpoint of 
achievement of testing these values are important. An item may 
be a perfectly valid item of general science, physics, or Latin, and 
yet not be justified for a high school course. It must be recognized, 
however, that attempting to decide in advance questions of preferential 
or relative value is difficult. If pupils pursued a natural attack upon 
any field there is no predicting precisely the direction their activities 
would take at all times. Items seemingly having little social value 
would take on importance because they were necessary to the develup- 
ment of the pupil in that field. In the end the normal or spontaneous 
development of the pupil may be more important than predetermined 
utility. 

In judging the quality of items one of the best guides is the concept 
of “‘g.” It has great utility also in selecting the method of teaching, 
and in indicating the nature of the testing. In the ordinary school 
room situation it is supposed that the pupil will direct his mental 
ability, his ‘‘g,”’ to the field in which he is working. Let us take 
arithmetic. Traditionally arithmetic is supposed to involve reasoning 
ability. It is supposed to involve that particular ability which we 
symbolize by ‘‘g.”” Arithmetic tests have been used, not by chance, 
as parts of mental tests. Thus Army Alpha, Terman A, Otis, and 
others include tasks of mathematical ability. It would seem then 
that in the field of arithmetic the curriculum would be such as to 
call into play “‘g.”’ The ability to see relationships ought to playa 
large part in development in this field. 

The practice seems to be otherwise. Arithmetic has become a 
“drill” subject in which the ability to see relationships plays but a 
small part. Instead the pupil who has a large amount of industry or 
perseverance may be able to memorize the “‘combinations’’ more 
thoroughly than one who has a considerable amount of “‘g,” likes to 
use it, and hence is a poor “student”’ in the field of memorizing the 
1200, or 1300 combinations. A scathing indictment of this method 
of teaching arithmetic has recently appeared, presenting a point of 
view similar to our own.! 





1 DeGrange, McQuilkin: Statisticians, Dull Children, and Psychologists. Edu- 
cational Administration and Supervision, Vol. XVII, Nov., 1931, pp. 561-573. 
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Arithmetic is not the only field in which the functioning of ‘‘g” 
is very limited. In geography, language, literature, and history’ 
it is possible to find curricula. making but little demand upon the 
intellect. Even when the content is not largely drill, it may still 
require but a minor functioning of ‘‘g” because it has been made 
trivial, piecemeal and narrow. Thus the social sciences as frequently 
taught in the high school ignore pretty largely the fundamental 
problems, the solution of which might call into play this ability 
*“*g.”” Infact one might seriously criticize much of the elementary and 
high school curricula as well as the tests of their mastery because 
they offer not only a much too limited opportunity for the exercise 
of this function, but also because they actually hinder it. It is 
entirely possible that a given high school “‘course,’”’ and the material 
accepted by the teacher as fitting into that course, will offer little or 
no opportunity for the brighter half of the student body to use its 
effective mental energy. In this connection one should consider the 
three major laws that govern all the processes involved in neogenesis 
or as we may think of it, the ‘generating of new items in the field of 
cognition.’’* Briefly, ‘effective mental energy” or “‘g”’ finds its 
greatest field of activity in relational thinking using he term in its 
broadest sense. The present tendency to emphasize minimum 
essentials, job or work sheets, drill and more drill, only means that 
the average and better than average child will soon have exhausted 
the possibilities for using his mental energy and will either turn to 
some other field than the curriculum, or may not use it at all, or only 
at a low level of potential. 

Spearman has pointed out that reproduction, and the educing 
of relations or correlates, may be entirely distinct. Once a concept 
becomes functional through the action of ‘‘g” future reproductions of 
it may involve little or no “‘g.” Mere sisted therefore, may 
correlate very low with ‘‘intelligence.”’ 

It would be entirely possible to provide a curriculum which per- 
mitted a fairly full exercise of ‘“‘g” for the majority of all of the pupils, 
yet this would not guarantee that the testing of achievement in that 
field would show that an outstanding pupil had exercised “‘g.”” Unless 
the test is so constructed as to permit the display of differential 
applications of ‘‘g” to the particular environment, course, or subject, 


1 Jernegan, M. W.: The Colleges and Historical Research. Historical Outlook, 
Vol. XVIII, March, 1927, pp. 105-107. 
? Holzinger, K.: Op. cit. 
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it would not reveal the facts. The application of “g” to this particular 
environment might be displayed in two ways; first, the eduction of a 
larger and larger number of relationships and correlates on a primary 
level, or the eduction of relations and correlates of higher and higher 
orders. In the last concept probably lies the true explanation of 
“ difficulty.” 

If a test of achievement in a given field embodies only items 
revealing the extent of relational thinking on the primary level, the 
number of these items must be very large and mirror a content that is 
extremely broad, before scores reflecting differential application of 
“g” to this environment could result. It is also evident that to 
reflect truly the application of individual differences in quantities of 
‘“‘g” as possessed by different pupils items representing educing of 
relationships and correlates of higher and higher order should be 
included. This would represent the inclusion of items of increasing 
degrees of difficulty rather than the more or less undependable method 
of tabulating percentages of failure. In the latter case the item may 
not represent real difficulty at all since it might be one involving 
a primary relationship of a degree of difficulty within the reach of all 
if the experience were common. The determination of degrees of 
difficulty is admittedly difficult and may perhaps never be amenable 
to precise determination. Certainly it would seem far more desirable 
to determine the correlation between achievement in a given field and 
‘“‘g”’ than to calculate percentage of failure on certain items for a mixed 
population without regard to the relationship existing between success 
or failure and amounts of ‘“‘g.” In fact, the present tendency to 
minimize mental ability in connection with school success is deplorable. 
The establishment of suitable curricula must eventually be based 
upon known differences in amounts of ‘‘g” in the population who are 
to pursue these curricula. 

It must be further recognized that difficulties in interpreting test 
results will occur if the tester confuses testing for ‘‘g’’ with testing 
for application of ‘g’’ within his field. It is here in all probability 
that the greatest difference lies as between the adherents of the 
so called “traditional examination,’’ and the objective test. Two 
difficulties present themselves in regard to the essay type of examina- 
tion. In the first place the situation set up by the examiner may in 
reality not test what the student can do with his knowledge. The 
essay examination may present a problem situation more dependent 
upon the function of ‘‘g’”’ alone than upon the extent to which “g”’ 
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has functioned in a particular field. In other words it may be a crude 
intelligence test. Secondly, the essay question may be so worded 
that relations and correlates of a primary order only will enter into 
the answer. 

It has been said that such men as Edison or Einstein would prob- 
ably not make good scores on the usual objective test. That may be 
true, but it need not be true if the objective tests were constructed 


‘in a manner to reveal the differential functioning of “‘g.”” It is hardly 


plausible that advances in the sciences or in any other field have been 
made by persons whose application of ‘‘g”’ to these fields could not have 
been revealed either by quantitative aspects of the extent of primary 
eductions or of eductions of successively higher orders. The ultimate 
utility of objective tests may depend upon our ability to con- 
struct tests so that they satisfy psychological criteria to the fullest 
extent. This is one of the best arguments for the professional training 
of college teachers. 

If one accepts as the best criterion of validity in achievement 
testing the selection of items from the field to be tested, the emphasis 
would seem to revert to the delimitation of the field, yet it does not 
follow that aims and methods are of no consequence, as shown above. 
One would have great difficulty in justifying courses of study and 
methods of teaching which place no premium upon the pupil’s effective 
mental energy. The progress of the race is dependent upon this par- 
ticular human ability more than upon any other, and that type of cur- 
riculum which fails to provide for its fullest utility, and that type 
of test which fails to reveal the effectiveness of the curriculum and 
method in this respect are open to criticism. This points to the 
greatest danger of the so called “‘mastery” technique. It too often 
leads to monotonous repetition, and to “marking time” in the case of 
the child well equipped with intellectual ability. 

C. Some Special Problems.—To elucidate further the application 
of the concept we are presenting, let us consider it in connection with 
certain other criteria with which validity is most frequently confused. 

Applicability.—It is usual to speak of a given test as valid for the 
fifth grade. It were better to say that the “Jones” test is a valid 
test of arithmetic applicable to the fifth grade. When we speak 
of applicability we can think of a scale measuring a given trait theoret- 
ically from zero to perfect ability. A given test may then be applicable 
over all or part of such a range. The fact that it does or does not 
apply to any part of this range is not necessarily an aspect of validity 
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as we have defined it. Many traits might be measurable, theoretically, 
from zero to perfect with nearly identical phenomena of distribution 
but be entirely different traits. It seems advisable then to consider 
validity as separate from applicability. 

To say that inapplicability means that a test is not valid leads us 
into certain misinterpretations. To argue that a test containing 
one hundred items of general science, made for the ninth grade and 
given to a seventh grade, is not valid because it is not applicable 
would mean to say that an item is an item in general science only when 
it is known or likely to be known to a given group or individual to 
which it is presented. Hence, if item number 97 were missed by all 
it would not be an item in general science. For that matter, one could 
logically conclude that any item missed by anyone was not an item in 
general science so far as that individual was concerned. The absurdity 
of this position is self-evident. Such an interpretation of validity 
would prevent the establishment of any degree of discrimination in 
any test. Confusing validity and applicability leads also to confusion 
in interpretation when a test is administered to a group more able 
than the one for which it was intended. If an item normally encount- 
ered before the child leaves the first grade is placed in an arithmetic 
test for the fifth grade is it not an item in arithmetic? Let us suppose 
that we give a test in general science, applicable to the ninth grade, 
to a group of college juniors and seniors. Some will make a lower 
score than the norm for the lowest 10 per cent of high school freshmen.! 
They have forgotten many of the things they learned. Are these 
items no longer general science? And again, why were they taught 
if they were not to be remembered? It is evident that one needs 
to consider this factor in validating tests. Items may be selected 
from the field and taught which one does not expect the student to 
remember.? But some justification should exist at the time they are 
placed in the curriculum for their presence. 

Discrimination.—Far more than in the case of applicability has 
this criterion been confused with validity. Thus a recent study 
has included evidences of discrimination as evidences of validity.’ 





1 We have data to this effect. 

* Douglas, H. R.: ‘‘Modern Methods of High School Teaching.’”’ Houghton 
Mifflin, pp. 21. 

*Sangren, Paul V.: Comparative Validity of Primary Intelligence Tests. 
Journal of Applied Psychology, Vol. XIII, 1929, pp. 394-412. 
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One frequently finds that the size of the standard deviation of a test 
is accepted as an evidence of discrimination. 

As a secondary criterion this is of great importance. After the 
items have been selected from the field to be tested this criterion 
should be applied to see how well the items discriminate between 
different levels of ability. But to use this as an evidence of validity 
might direct attention from the best criterion of validity. Theo- 


‘retically, at least, it would be possible to plan a test in some subject 


field, place no items in it from that field, and yet have a test which 
will show admirable discriminatory powers in some field, but not the 
right one. Of course in practice this does not occur, but we feel that 
the acceptance of the criterion of discrimination as a criterion of 
validity has served to do just what has been indicated—to distract 
the attention from genuine criteria of validity. 

It has already been suggested (part I) that such evidence as the 
size of the standard deviation can not be, strictly speaking, evidence 
of validity. On the assumption that a given test measures a trait 
which takes the form of the normal curve, the standard deviation 
plus other data may serve to determine whether or not the discrimina- 
tory powers of that test for a given population are exact. If they 
are, the scores will assume the form of a normal curve. Yet this 
phenomenon is a dangerous one to interpret in regard to validity. 
To say, as has been said, that of two tests the one having the smallest 
standard deviation is the more valid (or even discriminatory) is open 
to question, for as the standard deviation approached zero—assuming 
that the tests could be so reworked as to bring this about—the less 
discriminatory they become. It is obvious that unless one knows 
exactly the form of the true distribution for a given group comparisons 
of standard deviations may lead to erroneous interpretations. 

To compare the form of an obtained curve with that of the ‘‘ normal 
distribution curve”’ by inspection is not evidence of validity. If only 
one trait, say intelligence, were so distributed it might serve. But 
since many traits are assumed to be so distributed, many curves would 
be “‘normal,’”’ hence their shape would not serve to identify the trait 
being measured. Only by postulating that each trait has a peculiar 
curve could this method serve as a means of validating the test as a 
test of a given trait. 


SUMMARY 


This paper attempts to justify a single definition of validity and 
a single criterion for judging validity. The definition considered 
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in this discussion is that “validity is that property of a test by virtue 
of which it measures what it purports to measure.” The solitary 
criterion for judging validity is the selection of items from the field 
to be measured. A method for evaluating items selected from the 
field is already available in the case of mental measurement, viz., 
Spearman’s tetrad equation. In the case of achievement testing 
the delimitation must depend upon judgments of experts and the 
selection be based upon consideration of quality. Correlation with 
‘“g”’ is suggested as the best guide for judging quality. 

Other definitions than the one suggested, and other methods of 
validating than the one suggested distract from a scrutiny of field 
to be measured, both with regard to extent and structure. The use 
of the suggested definition and criterion would serve not only to place 
the emphasis upon the field where it belongs, but would, in the case of 
achievement testing, bring about a more thorough study of the char- 
acter of the test with regard to the nature of the field it measures 
and the quality of the items in the field. The inclusion of other 
definitions and methods of validity in test construction act to divert 
attention from the field itself, and this in turn militates against 
progress in test construction and interpretation of results. 
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EVALUATION OF PHOTOGRAPHIC MEASURES OF 
READING! 


MILES A. TINKER 
University of Minnesota 


AND 


ARDEN FRANDSEN 
University of Utah 


Since eye-movement records have now found a wide application 
as measures of reading performance, a more precise understanding of 
what each measure means is perhaps desirable. Four scores are 
available from eye-movement records: (1) Fixation frequency; (2) 
pause duration; (3) perception time which is defined as the sum of 
pause durations as there is no perception during eye movements; 
and (4) regression frequency or the pauses following backward eye 
movements which occur in reading a line of print. There is abundant 
evidence of non-discriminative use of the different scores. It is 
frequently implied that the measures all mean the same, or nearly the 
same thing. 

The purpose of this report is to present evidence from a variety 
of reading situations that will help to evaluate more adequately the 
significance of each of the four photographic measures of eye move- 
ments in reading. 

There is rather general agreement that reading ability is composed 
of at least two elements, 7.e., speed and comprehension. Tinker’ 
has pointed out that, when adequate methods of experimentation 
have been employed, all available evidence indicates a close relation- 
ship between speed and comprehension in reading. In eye movement 
studies, perception time is practically a pure measure of reading speed. 
It represents approximately ninety-four per cent of the reading time.’ 
This percentage is very constant in most reading situations. It 
follows, then, that perception time may be employed as a criterion of 





1 The material presented by M. A. Tinker was taken from a major study sub- 
sidized by the Graduate School of the University of Minnesota. The writers are 
indebted to Fred 8. Beers and Oscar F. Litterer for two groups of the data used in 
this study. 

2 Tinker, M. A.: The relation of speed to comprehension in reading. School and 
Society, Vol. XXXVI, 1932, pp. 1-3. 

* Tinker, M. A.: Eye movement duration, pause duration, and reading time. 
Psychol. Rev., Vol. XXXV, 1928, pp. 385-397. 
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speed, and presumably also of comprehension in reading to the extent 
that the two are correlated. The other photographic measures may 
be compared with perception time by the correlation technique. 

Using this method of comparison, Tinker! found that in reading 
easy prose, fixation frequency was a very good measure of reading 
speed. Pause duration and regression frequency were somewhat 
less satisfactory. In scientific prose, pause duration correlated very 
low with perception time. Eurich' found a higher correlation than 
Tinker between fixation and regression frequencies, but otherwise 
there was close agreement in the results of the studies. 

Except for the work done with children! these earlier comparisons 
involved few subjects. In both investigations a limited variety of 
reading situations were employed. These shortcomings are eliminated 
in the present investigation. 

The corneal reflection method of photographing eye movements 
while reading was used. The material read consisted of from ten to 
forty lines of easy narrative prose and of easy and difficult scientific 
prose, five paragraphs from a speed of reading test, and five types of 
objective examination questions. There were five questions of each 
type. The subjects were university students, fifty to two hundred 
sixteen per group. 

The reliabilities of the measures, computed by the split-half 
method and corrected for the full length of the selection, were high 
enough in all cases to justify group comparisons. They ranged from 
.55 to .93 and the median coefficients were: Perception time, .82; 
fixation frequency, .81; pause duration, .85; and regression frequency, 
83. 

The intercorrelations are shown in Table IA and Table IB. Exam- 
ination of these results reveals the following trends: (1) Fixation 
frequency quite consistently correlates high with perception time. 
This is true in the highly specialized as well as in the general reading 
situations. The two correlations of .73 and .74 probably reflect 
the slightly lower reliabilities for those readings. (2) Pause duration 
shows only a slight to a moderate sized correlation with perception 
time. When special reading situations are involved this intercorrela- 





1 Tinker, M. A.: Photographic measures of reading ability. J. Educ. Psychol., 
Vol. XX, 1929, pp. 184-191. 

Eurich, A. C.: Additional data on the reliability and validity of photographic 

eye-movement records. J. Educ. Psychol., Vol. XXIV, 1933, pp. 380-384. 
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tion is highly variable from group to group (.21 to .74). A qualitative 
analysis of the records indicates that a few disproportionately long 
pauses at the end of true-false, multiple-choice, and especially com- 
pletion items, resulting of course in longer perception times, are partly 
responsible for the higher coefficients. (3) Regression frequency 
yields a fair correlation with perception time. Although not a good 


measure of speed of reading, it is somewhat better and more consistent 


TaBLE 1A.—PHOTOGRAPHIC MEASURES OF READING COMPARED 






































Group and N Measures Fixation Pause Regression 
material compared frequency duration frequency 
I (LZ)! 71 | Perception time....| .78 + .03 .62 + .06 .58 + .05 
easy Fixation frequency.| ......... .02 + .08 71 + .04 
narrative. i oth bate paded E aabknnd ones .07 + .08 
IT (L)! 76 | Perception time....; 189 + .02 .72 + .04 .58 + .05 
scientific Fixation frequency.| ......... 25 + .07 .79 + .03 
prose. IS 6.6 OE 650 do Wace ae BR tehediuwedes -ll + .08 
III (7)! 77 | Perception time....|).82 + .03 47 + .06 .59 + .05 
easy Fixation frequency.| ......... .09 + .08 .72 + .04 
narrative. i fie ceiscduse E eeusenansus —.12 + .08 
IV (T)} | 77 | Perception time....|“%74 + .03| .52+ .06| .62 + .05 
scientific Fixation frequency.| ......... —.17 + .08 .79 + .03 
prose. RE SEE RE ee yee — .07 + .08 
V(B) _—_|216 | Perception time....|/.86 + 01] .48+ .04| .60 + .03 
reading Fixation frequency.| ......... .03 + .05 .71 + .02 
test. SIN EGE <a. vise eiaw BA weeds se se% —.07 + .05 

1, = Litterer’s data, T = Tinker’s data, B = Beers’ data. 
than pause duration. (4) Pause duration is not of either 







fixation frequency or of regression frequenc 
quency correlates rather high with fixation: 
tion is considerably higher in the special reading situations than in 
prose, due evidently to the greater proportion of regressions in reading 
the objective questions. 

From our analysis of performance in reading various kinds of 
material by several groups of subjects we may draw the following 
conclusions concerning photographic measures of reading: (1) Both 
perception time and fixation frequency are closely related to each 
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other and consequently are highly satisfactory measures of reading 
ability as here defined. (2) Regression frequency is moderately 
correlated with speed and hence is only a fair measure of reading 
ability. (3) Pause duration, because of variable correlations with 


Taste I1B.—PuHoTOGRAPHIC MEASURES COMPARED IN READING PROSE AND 
OBJECTIVE QUESTIONS 









































Group and N Measures Fixation Pause Regression 
material compared frequency duration frequency 
VI (F)! 66 | Perception time....| ~88 + .02 .63 + .05 .64 + .05 
scientific Fixation frequency.| ......... .22 + .08 .79 + .03 
prose. RI, LE witv'evuaw Baniecese dae .02 + .08 
VII (PF) 65 | Perception time....| .92 + .02 .21 + .08 .88 + .02 
Analogy Fixation frequency.| ......... —.16 + .08 .91 + .02 

PPT C4 uk wiissccsy EB Yecdce cutee —.12 + .08 
VIII (F) 66 | Perception time....|7.85 + .02 54 + .06 .69 + .04 
multiple Fixation frequency.| ......... .09 + .08 .88 + .02 
choice. SS EO Speer reer é —.05 + .08 
IX (F) 66 | Perception time....}7.73 + .04 .74 + .04 .55 + .06 
recall: Fixation frequency.| ......... .12 + .08 .87 + .02 
completion. SS rer errs errr es Te —.01 + .08 
X (F) 65 | Perception time....| »83 + .02 .69 + .04 .62 + .05 
True- Fixation frequency.| ......... .20 + .08 .86 + .02 
False esa Vou dcecce | ¢beweeen kun .03 + .08 
XI (F) 50 | Perception time....|“.88 + .02 .26 + .09 .70 + .05 
wrong Fixation frequency.| ......... —.20 + .09 .85 + .03 
word. ce sos ecade E eheeke mnie —.29 + .09 

1F = Frandsen’s data. 
. . ) pres ba: as ° oa: ° ° 

8 ral y@pbor measure of reading ability, particularly in 


special reading situations. 

Many pitfalls of interpretation, and unconvincing conclusions may 
be avoided by a more careful consideration of what each photographic 
measure of reading signifies. Writers should not infer that pause 
duration and regression frequency measure the same reading function 
as perception time, or pause duration the same as fixation frequency. 
Both pause duration and regressions are most useful in analyses of 
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oculomotor patterns which are characteristic of particular reading 
situations. In certain situations an increase in pause duration and 
regression frequency may even indicate an increase in reading effi- 
ciency. These oculomotor patterns are markedly influenced by the 
reading attitude which is determined by instructions to the reader and 
by the kind of material read. 
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THE SELECTION OF THE INTELLIGENCE QUOTIENT 
DIVISOR FOR CLINICAL CASES BETWEEN 


- FOURTEEN AND NINETEEN YEARS OF AGE* 
e 
d MITCHELL E. RAPPAPORT 


Rochester, New York 


Psychologists in community child guidance clinics are frequently 
called upon to suggest a life plan for wards of the community when 
they reach the age of sixteen, or for juvenile delinquents, most of 
them between the ages of fourteen and sixteen. The findings of the 
psychological examination are useful in formulating the plan, but 
frequently the usefulness of the Stanford-Binet intelligence quotient 
may be impaired by its ambiguity when it is expressed—as has been 
the case in the Child Study Department of the Rochester Society for 
the Prevention of Cruelty to Children, where this investigation was 
made—as IQ based on a divisor of fourteen years and also as based on 
chronological age up to sixteen years. This procedure has been 
justifiably followed because of the divergence of excellent opinion on 
the question of what is the most suitable chronological age base to use 
in computing the intelligence quotient on the Stanford-Binet of 
individuals beyond fourteen years of age. The present paper is an 
attempt to answer that question on the basis of actual clinical cases 
as they are seen in what is considered a typical community clinic. 
The psychologist has tested an individual of about sixteen years 
on a Stanford-Binet and finds that there is a range of ten points 
between IQ based on fourteen and IQ based on sixteen. Which of 
these two quotients is more likely to give a true picture of the indi- 
vidual’s intelligence? The problem is here attacked on purely prag- 
matic grounds and involves no attempt at answering the vexing 
questions of constancy of IQ, variability of rate of growth, age of 
maturation, and effect of environment on IQ. 





* The author is deeply indebted for invaluable criticism and assistance to the 
following members of the staff of the Child Study Department of the Rochester 
(New York) Society for the Prevention of Cruelty to Children: Dr, Carl R. Rogers, 
Mr. Gordon Riley, Miss Ruth P. Montgomery and Dr. Margaret B. Barker. Miss 
A. Leila Martin, director of the Child Study Department of the Rochester Board of 
Education, also gave valuable aid. 

t The unsettled nature of these problems was pointedly demonstrated at the 
National Research Council’s Conference on Individual Psychological Differences 
in Washington in 1930. In the report was found: 
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To express the real nature of the problem, let us consider a few 
cases in which the two quotients give different classifications. Case 
A is an eighteen year girl with a mental age of eleven years four 
months. Her IQ, then, ranges between seventy-one and eighty-one, 
according as it is based on sixteen or fourteen, and she may be either 
a low borderline or very dull girl. Case B is a sixteen year old boy 
with a mental age of thirteen years four months and an IQ between 


‘ eighty-three and ninety-five. He can be classified’ as dull or as 


normal. Case C is a sixteen year old girl with a mental age of sixteen 
years five months and an IQ between one hundred three and one 
hundred seventeen. She may be either normal or superior.* In 
each of these cases there had been examinations before the age of 
fourteen, and, assuming that the initial examination yielded a fairly 
accurate estimate of the individual’s intelligence, the early examina- 
tions indicate that in these cases the later IQ based on fourteen is 
probably misleading. Other cases indicated that the IQ based on 
sixteen was likely to be less accurate than the IQ based on fourteen, 
the criterion of accuracy again being correspondence with the earlier 


IQ. 
PROCEDURE 


From the records of cases with which the Child Study Department 
of the Rochester 8.P.C.C. had dealt, there were found one hundred 
fifty cases where the individual had been given a Stanford-Binet 
before the age of fourteen and after that age. The only purposeful 
selective factor was the element of emotional upset. Where the 
subject was obviously upset and had been noted so by the examiner, 
the case was ruled out. All other cases (one hundred fifty) were 
included. It must be pointed out, however, that the Child Study 





Freeman on Mental Indices (Appendix A, p. 1): ‘‘The IQ has been shown to 
be fairly constant for the revisions of the Binet scale.” 

Kuhlmann, commenting on Freeman’s paper (p. 9): ‘‘I have not been able, 
as some of you know, to find it (the IQ) constant, but rather that it has a general 
tendency to vary, the amount depending on the initial IQ and on age.” 

Goodenough, reporting on Values and Weaknesses of Present Scales for the 
Measurement of Intelligence, asserted that the form of the mental growth curve is 
unknown. (Appendix B, p. 13.) Kelly subscribed to that statement (p. 12) and 
Peterson pointed out that the straight line of mental growth, judged from a test, 
is liable to be based on selected and otherwise inadequate data (p. 14). 

* These classifications follow the suggestions in Terman, The Measurement of 
Intelligence, p. 79. 
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Department of the Board of Education cooperates with the 8.P.C.C. 
clinic, and it was found necessary, in order to get a group as large as 
one hundred fifty, to include cases where the initial examination may 
have been given at the Board of Education and the post-fourteen 
year examination by the S.P.C.C. and vice versa—that is, cases where 
the initial examination was given by the Child Study Department 
of the 8.P.C.C. and the post-fourteen year examination by the Board 
of Education. In only about half the cases were both examinations 
given by the 8.P.C.C. clinic, but because both clinics permit testing 
by experienced clinical examiners only, it was felt justifiable to include 
tests given at both agencies. Further, it was desired to limit the 
study to community clinic cases, and for that reason no attempt was 
made to study cases in which the Board of Education—conducting 
primarily an educational clinic—had made both examinations. All 
examinations in this study were Stanford-Binets. Some individuals 
were given more than two examinations, but only the first and last 
have been compared, except in cases where the first examination 
occurred before the age of eight. 

The mean age of the group at the time of initial test was 134.63 
months or eleven years three months. The mean age at the time of 
last retest was 186.19 months or fifteen years six months. The mean 
elapsed time between initial test and retest was 51.57 months or four 
years four months. At the time of last retest, forty-four individuals 
were between the ages of fourteen years one month and fourteen years 
eleven months, seventy-four were between fifteen years no months 
and fifteen years eleven months, and thirty-two were sixteen years or 
over. The mean IQ of the group on the initial Stanford-Binet was 
79.62 with a standard deviation of 12.44. 

The group consists of two large classes. The first, constituting 
fifty-three per cent of the total, is made up of children who had been 
or still were in foster homes at the time of the last retest. Of these 
it may be said that environmental improvement had occurred between 
the initial test and the retest, for most of them had been placed in 
foster homes by court action on charges of insufficient or improper 
guardianship. The initial test for most of them was made at the 
time of court action. The second class includes individuals who had 
undergone no beneficial environmental change previous to the last 
examination. This group consists chiefly of fifty-two individuals 
who were referred by court for study; they were charged with juvenile 
delinquency. Court also referred five cases on charges of insufficient 
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guardianship where the children were beyond fourteen years of age 
and three were referred for study in connection with the question of 
mental defect. There was also a scattering of private cases. In 
brief, this second class might be described as largely a behavior 
problem group where there had been no foster home care or care 
outside the child’s own home previous to the last examination. 


RESULTS 


In handling the data it was felt that two procedures might be of 
value. The first was to make comparisons using all cases but chang- 
ing the IQ divisor. Thus, deviations between tests were found using 
only fourteen years as the divisor for the retest IQ, then using chrono- 
logical age for cases up to fifteen and fifteen for cases beyond, and 
finally using chronological age up to sixteen and sixteen for cases 
beyond. In this manner the entire group was included in each of 
the comparisons. A second procedure emphasized age at time of 
retest. Thus, deviations between tests were found for those whose 
retest occurred between fourteen years one month and fourteen years 
eleven months, for those between fifteen years and fifteen years 
eleven months, and for those sixteen years and beyond. Within 
each of these age groups comparisons were made between IQ’s on 
initial test and retest, the retest being computed using chronological 
age, fourteen years, fifteen years, and sixteen years as the base. 

What change is found in the mean IQ of the group between initial 
test and retest? The mean initial IQ was 79.62 (SD equals 12.44). 
When the retest is computed with fourteen years as the divisor, the 
mean retest IQ is 88.80 (SD 15.13), an increase in mean IQ of 9.18 
points. When the retest IQ is computed with a divisor using chrono- 
logical age up to fifteen years and fifteen years for cases older than 
fifteen, the mean IQ becomes 80.21 (SD 13.68), and the change in 
mean IQ falls to an increase of 0.59. Finally, when the retest IQ is 
computed with a divisor using chronological age up to sixteen years 
and sixteen for cases beyond that age level, the mean IQ becomes 
77.91 (SD 14.00), and the mean IQ change becomes a decrease of 
1.71 points. These data are presented in Table I. 

Deviations between initial and retest IQ’s were derived, changing 
the conditions of the IQ divisor as indicated in the preceding para- 
graph. When fourteen is used as the IQ divisor, the retest IQ exceeds 
the initial IQ in one hundred eight cases and falls below the initial 
IQ in thirty-seven. In five instances there is no change. The aver- 
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age positive deviation is 9.38; the average negative deviation is 
somewhat more than half that—5.30, and the average arithmetic 


deviation almost doubles that found in most retest studies, it being 
in this case 8.06. 


TaBLe I.—Mean IQ anv VaRiaBILiTY oF INITIAL Test AND Retest, ACCORDING 
to IQ Drvisor on Retest. N = 150 








Change in 
Mean IQ SD mean 10 
Initial Stanford-Binet.....................-. 79 .62 12.44 
DI wa ncdicsneceocscsewss 88.80 15.13 plus 9.18 
Retest, CA to 15 and 15 as divisor........... 80.21 13.68 plus 0.59 
Retest, CA to 16 and 16 as divisor........... 77.91 14.00 | minus 1.71 














When chronological age up to fifteen and fifteen are used as the 
IQ divisor, retest IQ’s exceed initial IQ’s in seventy-four and fall 
below in sixty-five cases; in 11 instances there is no change. Average 
positive and negative changes are roughly equivalent—average 
positive deviation being 6.65 and average negative deviation being 
6.57. The average arithmetic change is 6.61. 

Negative changes exceed positive changes when the IQ base 
becomes chronological age to sixteen and sixteen for older cases. Here 
we find the IQ increasing on retest in fifty-seven cases while it decreases 
in eighty-five. The average increase is 5.62 and the average decrease 
7.15. The average arithmetic change is the lowest of the three 
groups—6.15. In eight cases there is no change. Table II presents 
the data on deviations with changing IQ divisors. 

The criterion of best fit was also thought to be of value in throwing 
light on the suitability of a particular 1Q divisor. Initial test and 
retest were compared, the retest being considered with varying IQ’s 
as a result of changing the IQ base. The divisor, within the limits 
already presented, which yielded a retest IQ most nearly similar to 
the initial IQ received a tally. In some instances both fourteen and 
CA to fifteen, or CA to fifteen and CA to sixteen, may have been 
tallied, for both may have given best fits, one yielding a positive 
deviation arithmetically equivalent to a negative deviation yielded 
by the other. 

The method of best fit again shows the superiority of sixteen years 
as a base and the inferiority of fourteen. When fourteen is used as 
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the retest IQ divisor, the IQ gives 54 or thirty-one per cent best fits. 
Using CA to fifteen and fifteen, the retest IQ divisor fits best in fifty- 
six or thirty-two per cent cases. When sixteen is used as the retest 
IQ base, there are sixty-five best fits, or thirty-seven per cent. The 
best fit data also occur in Table IT. 


Taste IJ.—Best Firs anp Deviations oF IQ’s BETWEEN INITIAL TEST AND 


Retest, ConsipgeRING ALL Casgs, BUT CHANGING THE Maximum IQ 











Drvisor 
IQ divisors 
14 CA to 15| CA to 16 

and 15 | and 16 
Re a OR oa PN ED ey ap 54 56 65 
ee nn cm nae i oe oe ae 31 32 37 
Positive deviations, N......... CPi chuewdnd cee 108 74 57 
Negative deviations, N........................ 37 65 85 
Mean positive deviation....................... 9.38 6.65 5.62 
Mean negative deviation....................... 5.30 6.57 7.15 
Mean arithmetic deviation..................... 8.06 6.61 6.15 
en re is od ccs oa kin a paler 5 11 8 














Correlation coefficients were computed for the correlation between 
initial test and retest, retest IQ’s varying. These coefficients of 
correlation were found: 

1. Correlation between initial IQ and IQ on retest when fourteen 
is used as retest IQ divisor: 


r= +.82 + .018] 


2. Correlation between initial IQ and IQ on retest when CA to 
fifteen and fifteen are used as IQ divisor: 


r= +.88 + .012 


3. Correlation between initial IQ and IQ on retest when CA to 
sixteen and sixteen are used as IQ divisor: 


r= + .84+ .016 
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Considering the nature of the group, these correlations compare 
favorably with those found in other studies of retests. This is espe- 
cially true of the correlation between tests when the final test is based 
on an IQ divisor with fifteen as the limit. 

The second method of manipulating the data involved the breaking 
up of the group into classes according to age at retest and making 
comparisons between tests with changing IQ divisors within each age 
classification. There were forty-four cases in which the individual 
was between fourteen years one month and fourteen years eleven 
months at the time of retest; seventy-four cases between the ages of 
fifteen years and fifteen years eleven months, and thirty-two cases 
sixteen years of age and over at the time of retest. 

Within the fourteen years one month to fourteen years eleven 
months group the use of chronological age as divisor yields results 
more nearly similar to the initial IQ than does the use of fourteen 
years as the 1Q divisor. The use of chronological age as divisor yields 
twenty-four positive changes with an average positive change of 
6.00 and twenty negative changes with an average negative change 
of 7.15. The mean arithmetic change is 6.52. When fourteen is 
used as the divisor, there is a tendency for the retest IQ to exceed the 
initial IQ. Thus, there are twenty-six positive changes and sixteen 
negative changes. (There are two cases without change.) The 
average positive deviation is 8.96 and the average negative deviation 
5.25. The average arithmetic change is greater than when chrono- 
logical age is used as the IQ divisor, being 7.20. 

Within the fifteen years no months to fifteen years eleven months 
group it was possible to make comparisons using chronological age, 
fourteen years, and fifteen years as IQ divisors. IQ’s derived in these 
ways were compared with initial IQ. Again it is found that the 
intelligence rating using fourteen years as IQ divisor tends to overrate 
the individual. When fourteen is used, there are fifty-seven positive 
changes and only fourteen negative changes. (There are three cases 
without change.) The average positive change is 8.86, the average 
negative change 5.43, and the average arithmetic change 7.85. Using 
chronological age as the IQ divisor for this group tends to yield results 
below those on the initial examination. There are twenty-four posi- 
tive changes compared with forty-four negative changes (six cases with 
no change). The mean positive change is 5.25 and the mean negative 


change 7.55. The mean arithmetic change using chronological age 
is 6.19. 
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The use of fifteen years as the IQ divisor within the fifteen to six- 
teen year old group seems to give results most nearly comparable to 
the initial rating. Positive and negative changes are nearly equal in 
number and in size. There are thirty-two positive changes and 
thirty-four negative changes (eight cases without change). The 
mean positive change is 6.47 and the mean negative change is 6.21. 
The arithmetic change is lower using fifteen as divisor than it is when 


‘chronological change or fourteen are used as divisors. Fifteen yields 


a mean arithmetic change of 5.65. 

For the thirty-two cases in the group sixteen years of age and 
older, IQ comparisons were made between initial test and retest with 
retest IQ’s computed with fourteen, fifteen, and sixteen as divisors. 
In this group fourteen is definitely unsuited. Its use yields twenty- 
five positive changes of an average of 10.92 points and only seven 
minus changes of an average of 3.86. The average arithmetic change, 
using fourteen, is 9.38. The use of fifteen also tends to yield higher 
1Q’s than those secured on the initial examination. There are, using 
fifteen, nineteen positive changes averaging 7.21 points and ten 
negative changes averaging 6.50 points. The average arithmetic 
change is 6.31. There are three cases without change. As an IQ 
divisor, sixteen tends to yield more negative than positive changes— 
eleven positive changes to twenty negative changes. The mean 
positive change is 4.90, the mean negative change 6.75, and the mean 
arithmetic change 5.91. These facts are presented in Table III. 


PROBLEMS IN INTERPRETATION 


There are several rather important questions which are involved 
in the interpretation of the results presented above. These questions 
involve the constancy of the intelligence quotient, the variability 
of rate of growth of intelligence, the period of maturation of intel- 
lectual function, and the effect of foster home care—or improved 
environment—on native intelligence as measured by the intelligence 
quotient. Unfortunately, none of these questions seems conclusively 
to have been answered. 

Conflicting opinions are found and conflicting data are presented 
in connection with the constancy of the IQ. Among others, Dear- 
born,’ Garrison,’ Poull,’* and Henmon and Burns!” have reported con- 
stant I1Q’s. Freeman® has subscribed to the constancy of the IQ and 
has also reported® a tendency for the IQ to fall. Brown,? in reporting 
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a study of behavior problem children, stated that the amount of 
change from one examination to another is small, that the fluctuations 
in the ratings of problem children are little greater than those for 
normal children, and that there is less change in feeble-minded children 
than in behavior problem children who are not mentally defective. 
Garrison’s data are not very helpful for purposes of comparison, since 
only thirty per cent of his retest cases were tested after the age of 
fourteen. Poull’s study dealt with feeble-minded boys only and 
indicated a high degree of constancy of IQ. 


TasLe III].—DEvIATIONS BETWEEN INITIAL TEsT AND Retest IQ’s wiTH 
DirFreRENT Retest IQ Drvisors, Goupsep AccorpinG To CA aT 
Time oF RETEsT 























IQ divisor 
CA 14 15 16 
Positive changes... ..........se00. 24 26 
CA 141 | Negative changes................. 20 16 
a ETE T TET Ter Eee 0 2 
at retest | Mean increase................... 6.00 | 8.96 
N = 44 | Mean decrease................... 7.15 | 5.25 
Mean arithmetic change........... 6.52 | 7.20 
Positive changes...............:. 24 57 32 
CA 15-0 | Negative changes................. 44 14 34 
Se | Sr ad eccatd cccrtcesscetun 6 3 8 
at retest | Mean increase................... 5.25 | 8.86) 6.47 
N = 74 | Mean decrease................... 7.55 | 5.43) 6.21 
Mean arithmetic change........... 6.19 | 7.85 | 5.65 
Positive changes................. 25 19 11 
CA 16-0 | Negative changes................. 7 10 20 
I ERS oa 5 cee dwecdincccncens 0 3 1 
at retest | Mean increase................... 10.92 | 7.21) 4.90 
N = 32 | Mean decrease................... 3.86 | 6.50] 6.75 
Mean arithmetic change........... 9.38; 6.31 | 5.91 














Terman”! and Hildreth" have reported a slight tendency on the part 
of the IQ to increase. Terman’s study of the intelligence of school 
children revealed a tendency upward in bright, average, and dull 
groups: plus 0.7, plus 3.0, and plus 1.2, respectively. Hildreth’s 
data dealt with Lincoln School pupils; in no case among the results 
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| of five tests was an IQ below eighty found, whereas the mean IQ 
q of the group in this study is eighty. With a group whose average 

IQ was one hundred nineteen on all tests, Hildreth found a slight 
| tendency to positive increase. 

' Freeman® and Terman”? have intimated that the IQ tends to fall, 
‘ the latter restricting his statement to IQ’s below sixty. Freeman, 
i, however, found this to be the case for all groups when he reported his 
| . study on the effect of foster home care on intelligence. 
ba The question of the relative rate of mental growth for persons of 
54 differing intelligence is also an unsettled one. Dearborn,’ Terman,”! 
: Poull,’* and Freeman’ have found the relative rate of growth to be the 
same for all groups. From his examination of school children Terman 
arrived at a growth curve which was roughly equivalent for all groups. 
Poull, in her testing of feeble-minded boys, found an average arith- 
metic change in IQ of plus 1.28 between tests, and agreed with Ter- 
man that the growth curve for feeble-minded children does not differ 
from that for unselected school children. In his Mental Tests Free- 
man presents data from various investigations and concludes that 
individuals advance at about the same relative rate. 

Other investigators have found that there is a tendency for dull 
and feeble-minded children to grow intellectually at a declining rate, 
compared to the curve for normal individuals. Henmon and Burns!” 
studied seventy-seven special class pupils and reported: 


Out of the fifty-nine cases (retested on the Stanford-Binet), thirty-one show a 
loss on the retest, the average loss being 6.3 points, while twenty-four cases show 
an average gain of 5.9, with four cases yielding an identical score. The median 
change is a loss of 1.75 IQ points and the average difference 4.8 points. The 
typical result to be expected from such a group, then, is a loss rather than a gain 
. . . It is well established that with feeble-minded subjects the IQ tends to 
decrease. This is the evident tendency with borderline cases also, and it is a very 
significant fact in making provisions for them and predictions concerning them. 








Pa The problem of age of maturation of general intellectual ability 

ta has direct bearing on the selection of the upper limit of IQ divisors. 

Bay Dearborn’ and Pintner’’ have favored fourteen years, but their posi- 

hie tion has been challenged by recent studies and by criticism of the 
hae data from which Dearborn and Pintner arrived at their advocacy of 
em fourteen years. Freeman’ has criticized the Dearborn group test, 
ae and his indication of the fallacies involved in the use of the Army 
a B group test data is supported by Terman. Terman?! has suggested 
. | sixteen as the year of maturation, but in a later article** he intimated 
Sick ia 
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that sixteen might be too high and that fifteen might be the most 
suitable IQ divisor for adults. Sixteen years at least, and possibly 
seventeen, eighteen, or nineteen as the limit of measurable growth 
have been supported by Freeman’* and Thorndike.** Heinis*! has 
estimated the increment of intellectual growth up to the age of forty. 

In interpreting our data we must add the factor of improvement 
in environment, for half of our group were foster home children. The 
influence of improved environment on intelligence is a question that is 
about as unsettled as those presented above. Freeman and others*® 
report a study of foster home children in which there was included a 
group of seventy-four who were given Stanford-Binet examinations 
before and after placement in foster homes. For this group the mean 
initial 1Q was 91.2 and the mean final IQ, after four years of foster home 
care, was 93.7, a net increase of 2.5 points. Burks,’ reporting a study 
of adopted children, found that improvement in environment had 
to be quite marked before significant IQ changes could be effected. 
Rogers, Darling, and McBride” found that a high grade institutional 
environment could not influence the IQ’s of girls from low grade homes. 

The group included in the present study is not exactly comparable 
to any of the groups so far reported, for it is a mixture of foster home 
and problem children, including some feeble-minded and borderline 
individuals. When the foster home group is considered separately, 
it is found that in thirty-one cases the IQ on retest is increased, while 
in thirty-eight there is a decrease on the retest. In seven cases there 
isnochange. The average positive change is 5.4, the average negative 
change 5.8, and the average arithmetic change 5.6. The net algebraic 
change is —.77. These changes are computed on a basis of an IQ 
divisor using chronological age to sixteen and sixteen for cases beyond 
that age. The change in mean IQ for the entire group, using CA to 
sixteen as the IQ divisor, is —1.71. From this it can be said that the 
foster home group showed no appreciable advantage in IQ gains over 
the non-foster home group. 


CONCLUSIONS 


1. None of the criteria of suitability employed in this study of 
one hundred fifty clinical cases tested on the Stanford-Binet before 
fourteen years and after that age (from fourteen to nineteen) justifies 
the practise of using fourteen years as the maximum IQ divisor for 
child guidance cases beyond fourteen years of age. 
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2. If it is assumed that mental growth proceeds at the same 
relative rate for all groups and that the Stanford-Binet at the higher 
ages yields an IQ similar to that obtained before fourteen, then the 
use of an IQ divisor of CA to fifteen seems most suitable for these 
reasons: 

(a) CA to fifteen yields the smallest change in mean IQ between 
initial test and retest. 

(b) CA to fifteen yields the highest correlation between initial 
test and retest. 

(c) CA to fifteen yields the best balance in number and size of 
positive and negative deviations between tests for the group as a 
whole. 

(d) For the seventy-four cases between the ages of fifteen and 
sixteen at retest, CA to fifteen yields the most nearly equal balance of 
size and number of deviations. 

3. If it is assumed that mental growth becomes more variable in 
rate at the higher ages, with the dull group showing deceleration, and 
if it is also assumed that the Stanford-Binet tends to yield decreasing 
1Q’s for the higher age groups, the CA to sixteen appears to be the 
most suitable IQ divisor for these reasons: 

(a) CA to sixteen yields deviations on retest in the expected 
(negative) direction. The change in mean IQ’s for the group as a 
whole is —1.71. 

(b) For the group as a whole CA to sixteen yields negative devia- 
tions on the retest which exceed the positive deviations both in number 
and in size. 

(c) For cases over sixteen at retest, CA to sixteen as the IQ divisor 
is the only one which yields more negative than positive deviations. 
For this group the mean negative change is greater than the mean 
positive change. | 

4. The influence of foster home placement on IQ in the case of 
seventy-six individuals included in this study appears to be negligible, 
the net algebraic change in IQ being —.77 when CA to sixteen is 
used as the IQ divisor. 

5. Since the weight of evidence in other studies has indicated a 
tendency for dull non-foster home groups to lose in IQ, and since for 
the present foster home group there is no strong counter-tendency, 
it appears that for clinical practise where the subjects tend to be of 
less than normal intelligence, the use of CA to sixteen as the IQ divisor 
is most suitable. 





on 


~I 


10. 


ll. 


12. 


13. 


14. 


15. 


16. 


17. 
18. 


19, 





) 
; 


oO 


a 


10. 


11. 


12. 


13. 


14. 


15. 


16. 


17. 
18. 


19, 








Clinical Cases between Fourteen and Nineteen 


BIBLIOGRAPHY 


. Blanchard and Paynter: The Problem Child. Mental Hygiene, Vol. VIII, 


pp. 26-54. 
Brown, A. W.: The Changing Intelligence Quotient in Behavior Problem 
Children. Journal of Educational Psychology, Vol. X XI, pp. 341-350. 
Burks, B. S.: The Relative Influence of Nature and Nurture Upon Develop- 
ment. N.S.S.E. 27th Yearbook, 1928. 


. Report of the Conference on Individual Psychological Differences, National 


Research Council, Washington, 1930. 

Dearborn, W. F.: “Intelligence Tests.”” Houghton Mifflin, 1928. 

Freeman, F. N.: ‘‘Mental Indices.”” Report of Conference on Individual 
Psychological Differences, National Research Council, Washington, 1930. 

Freeman, F. N.: ‘‘ Mental Tests.”” Houghton Mifflin, 1926. 

Freeman, F. N. and others: The Influence of Environment on the Intelligence, 
School Achievement, and Conduct of Foster Children. N.S.S.HZ. 27th 
Yearbook, 1928. 

Garrison, 8S. C.: Additional Retests by Means of the Stanford Revision of the 
Binet-Simon Tests. Journal of Educational Psychology, Vol. XIII, pp. 
307-312. 

Goodenough, Florence: ‘‘The Values and Weaknesses of Present Scales for 
the Measurement of Intelligence.”” Report of Conference on Individual 
Psychological Differences, National Research Council, Washington, 1930. 

Heinis, H. A.: A Personal Constant. Journal of Educational Psychology, Vol. 
XVII, p. 163. 

Henmon, V. A. C. and H. M. Burns: The Constancy of the Intelligence Quo- 
tient in Borderline and Problem Cases. Journal of Educational Psychology, 
Vol. XIV, pp. 247-250. 

Hildreth, Gertrude: Stanford-Binet Retests of Four Hundred Forty-one 
School Children. Pedagogical Seminary, Vol. XX XIII, pp. 365-386. 

Hirsch, N. D. M.: An Experimental Study upon Three Hundred School 
Children over a Six Year Period. Genetic Psychology Monographs, Vol. VII, 
1930, pp. 487-549. 

Kuhlmann, F.: The Results of Repeated Mental Re-Examinations of Six 
Hundred Thirty-nine Feebleminded over a Period of Ten Years. Journal 
of Applied Psychology, Vol. V, pp. 195-224. 

Lincoln, E. A.: The Reliability of the Stanford-Binet Scale and the Constancy 
of the Intelligence Quotient. Journal of Educational Psychology, Vol. 
XVIII, pp. 621-626. 

Pintner, Rudolph: “Intelligence Testing.”” Henry Holt, 1931. 

Poull, Louise: Constancy of the Intelligence Quotient in Mental Defectives, 
According to Stanford-Binet Tests. Journal of Educational Psychology, 
Vol. XII, pp. 323-324. 

Rogers, A. L., Dorothy Darling, and Katherine McBride: The Effect on the 

Intelligence Quotient of Change from a Poor to a Good Environment. 

N.S.S.E. 27th Yearbook, 1928. 


Sarre eS Se <~ae = 


Pa 








a ery 5 
+ aah ipa x ~ 


~ Sees 


‘a Bug 


ie 


114 The Journal of Educational Psychology 


20. Rugg, H. O. and C. Colloton: Additional Retests by Means of the Stanford 
Revision of the Binet-Simon Tests. Journal of Educational Psychology, 
Vol. XIII, pp. 307-312. 

21. Terman, L. M.: “The Intelligence of School Children.” Houghton Mifflin, 
1919. 

22. Terman, L. M.: Mental Growth and the Intelligence Quotient. Journal of 
Educational Psychology, Vol. XII, pp. 325-341, 401-407. 

23. Thorndike, E. L.: On the Improvement of Intelligence Scores from Thirteen to 
Nineteen. Journal of Educational Psychology, Vol. XVII, pp. 73-76. 


; 




































VARIATIONS IN AFFECTIVE TONE OF DIFFERENT 
AREAS OF EDUCATIONAL PSYCHOLOGY 


GEORGE W. HARTMANN AND ANSON MARK HAMM 
Pennsylvania State College 


I. THE PROBLEM STATED 


In a previous article in this Journal' one of the present writers 
described a technique for measuring the relative interest value of 
representative items commonly taught in elementary or introductory 
courses in general psychology.* For the sake of both confirming 
and extending the results of this earlier study it seemed desirable to 
apply a similar procedure to the field of educational psychology in 
the hope that the material organization and instructional policies 
involved in this subject of instruction would be improved thereby. 
The theoretical justification for this lies in the principle that the 
effectiveness and permanence of learning is greatly associated with the 
degree of interest exhibited by the learner. 

Every experienced teacher knows that the topics included in most 
conventional courses differ widely in the degree of interest which 
they arouse in the pupil. No person, whether teacher or student, 
finds a hundred new bits of information about physics or philology 
equally attractive. Optics and electricity may be enjoyable by one 
who is unmoved by mechanics or heat. Some facts provoke a thrill 
of delight and others secure nothing but the most perfunctory response; 
some may even be positively distasteful. An extraordinary range in 
intensity of emotional participation is found not only from datum to 
datum but from one individual experience to another with the result 
that most observers despair of detecting any consistency in such a 
hopeless tangle of subjectivity and cite with finality the old saw, 
“De gustibus non est disputandum.” Nevertheless, science cannot 
allow itself to be permanently baffled. The quantitative excursions 
made by psychologists into the realm of feeling-tone have not remained 
wholly fruitless, and seemingly naive measures of likes and dislikes 
have yielded significant practical advances in the field of vocational 
guidance (Strong), social attitudes (Thurstone), personality analysis 





* The essential features of this procedure originated in a suggestion of Dr. C. C. 
Peters, Director of Educational Research at the Pennsylvania State College. 
Similar studies have already been made by his students for the fields of biology 
and chemistry. 
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(Allport), and in understanding the psychical nature of ‘‘interest’’ 
itself (Fryer). It is highly probable that the efficient selection of | 
subject-matter for courses may be partly conditioned by the average 
interest aroused by each detail of the material; at any rate, this was 
the feature which provided the point of departure for the investiga- 
tion now to be described. 


—_— es OS Oe 


II. ASSEMBLY AND COMPOSITION OF THE TEST ITEMS 


A determination of the interest value of the various divisions of 
educational psychology first requires that a reasonably complete and 
‘“‘random” sampling of the content of its different parts be made. It 
is notorious that educational psychology is a less well-defined and 
| strictly circumscribed region of knowledge than abnormal psychology 
or school finance. While an inspection of a dozen familiar texts with 
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this title revealed a fair uniformity and tacit agreement concerning 
i appropriate problems for inclusion, the treatment and emphasis 
: differed considerably; hence, it was felt that for the special purpose of 
ee this study standardized achievement tests or ordinary final examina- 
tions with simple declarative statements provided better source- 
; material than the texts themselves. Since we were primarily 
HW concerned with the strength of the average interest manifested in each 
factual item, the statements chosen had all to be true or easily made 
| such by a simple alteration in wording. A representative list of 
Pt two hundred thirty-four items was eventually selected from the 
7) Professional Education blank used in the Carnegie Pennsylvania 
hu Study, the two forms of the Pottkoff-Corey educational psychology i 
“EXE booklet, and the Graduate Placement test employed by the School } 
e of Education at the Pennsylvania State College. Since these items 


EDUCATIONAL PsycHoLoGy INTEREST TEST ARRANGED ACCORDING TO 
i SECTIONAL SUB-DIVISIONS 
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TaBLE ].—DistrisuTion oF Two Hunprep Tutrty-rour ITEMS IN THE 7 
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r Part Number of items Per cent 





TEER i ee 67 28.7 

| ia. 4 bis sem sels woes we 64 27.4 
iB 3. Individual differences.................... 61 26.7 
: 4. Psychology of school subjects............. 42 18.1 











had already been subjected to the scrutiny of previous test-makers, W 
the major task remaining consisted in making a proper allotment to 1 
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each of the four major sections of educational psychology which have 
become distinguishable since the days of Thorndike’s classic work. 
To avoid errors in classification and needless overlapping, each item 
had to run the gantlet of three instructors of this course—if any one 
disagreed as to the correct placement, the item was eliminated. Table 
I gives the grouping and distribution of the test content which finally 
resulted. 


III. ADMINISTRATION AND SCORING OF THE ‘‘INTEREST”’ TEST 


An eight-page booklet containing the two hundred thirty-four 
items was printed with directions requesting the respondent to rate 
each item separately with respect to its ‘‘pleasure’’ value or the 
amount of satisfaction felt in recognizing the fact presented. A scale 
placed beside each item permitted the respondent to indicate the 
reaction as follows: 


Pleasure. 
EINER, RS pe ee ee ane Re, RS, ey OR Pepe eT aR 3 
RE a 2 Ne A RR a as ye ee Lae ae 2 
SES cama sia alan WG kia mone ca. vine Ce hy ettee ais ok ba ee ae 1 
ee cones ox ssh sale cc che es bale a ope ee eees Ue 0 
Aversion. 
I is i eh cd Maen ied a os 4d ates wks ieee —1 
CS nn  aradnkd mache bike ae ae —2 
Te ee ee oy odd as ula bk MOEN 6a —3 


Responses were secured from more than three hundred students 
in five different colleges of Pennsylvania with seven classes in educa- 
tional psychology involved, thus ensuring a reasonably generous 
sampling of teaching conditions. Most of the subjects were sopho- 
mores or juniors taking their first course in educational psychology. 
The test was given in a regular class period during the final week of 
the semester so that fair contact with this type of material had already 
been obtained. 

The next step was the computation of the mean pleasure value 
of each item, which was found by dividing the algebraic sum of the 
checked values placed opposite each statement by the number of 
individuals responding to it. For example, the highest average 
“interest” is attached to the item, ‘“‘Other things being equal, the 
stronger the motive for learning, the faster will be the learning process”’ 
with an index of 2.04; the lowest hedonic index belonged to the state- 
ment, “‘ About 2 per cent of the child population of a community is too 
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defective mentally to profit from instruction in the public schools’ 
with an index value of —.29, showing a slight aversion tendency. 
It is consoling to find that only three of the statements possessed 
negative pleasure value! Since the composite mean pleasure value 
for two hundred ninety-seven papers was .94 + .012 (the mean com- 
posite score for the odd-numbered statements was .92 and for the 
even .96, with an SD of the distribution of .29) it is evident that 


‘ the average fact encountered in educational psychology possesses 


some slight but positive pleasure value. In the Appendiz will be 
found a complete list of the statements arranged in descending order 
according to the size of the corresponding pleasure index. 


IV. COMPARISON OF DIFFERENT SECTIONS OF EDUCATIONAL 
PSYCHOLOGY WITH RESPECT TO THEIR PLEASURE INDICES 


Before presenting the major findings of this investigation it will be 
desirable to examine the reliability of the measure we have employed. 
The self-correlation of the interest test was obtained by taking every 
third paper from the entire pack of two hundred ninety-seven, and 
matching the grand mean pleasure value of the one hundred seventeen 
odd items for each respondent with the corresponding mean pleasure 
value for the even items. This yielded a coefficient of .95 + .006. 
In addition, the consistency of the instrument was determined by 
correlating the mean pleasure value of each item in a batch of eighty 
blanks selected at random from the first half of the pack with the 
analogous index for the corresponding statement in another group of 
eighty papers similarly chosen from the second half of the pack. 
This yielded a corrected Pearson r of .91 + .007. Both values are 
high enough to warrant confidence in the stability of the data and 
serve to remove the objection that peculiar group reactions may be 
involved. 

The average pleasure values for all the statements were grouped 
into the four categories mentioned in Table I above and the means 
of these means computed as shown in Table II. The remaining statis- 
tical calculations implied in the table will be clear to the trained 
reader since they all follow conventional research practice.* From 





* Through an oversight, the correlations between the different divisions of the 
blank were not computed. Since these are probably all very high, the inclusion 
of the correlation terms in the standard errors would have raised the values in the 
last column of Table II to approximately one hundred out of one hundred. Con- 
sequently, the ‘‘chances”’ as they now stand represent an understatement of the 
case. 
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these figures we may generalize that material dealing with the ‘‘laws 
of learning’ makes the greatest appeal to the ordinary student, 
followed by the “‘ psychology of the school subjects,” ‘native traits,’’ 
and the field of ‘‘individual differences.”” The fact that this is the 
hierarchy prevailing at present does not mean that the relative posi- 
tion will remain fixed. New social conditions, a changed educational 
philosophy, and internal transformations of the subject itself may 
produce either a re-distribution of the affective tone of the various 
items and sections or a general rise and fall in the mean hedonic level 
of the entire field. But before we can concern ourselves about the 
control of future shifts, we must know the contemporary status from 
which these shifts are to be made. 


Taste I].—Heponic Rank Orper or Factua.t STATEMENTS IN EDUCATIONAL 
PsycHoLoGy WHEN ComMBINED AND C.assiFIED AccorDING To Its Four 
Marin Drvistons (N = 297) 


























oo Chances of 
shone Diff. of one hundred 
Group SD | SD,,. means D/SDair| in favor 
Boe between of obtained 
index . 
difference 
1. Laws of learning..| 1.07 | .39 | .049 | (1)-(2) = .10|] 1.33 90 
2. Psychology of 
school subjects.| .97 | .36| .055 | (1)-(3) = .19| 2.61 99 
3. Native traits..... .88 | .44] .054 | (1)-(4) = .24] 3.20 100 
4. Individual differ- 
ea ccesdse .83 | .43 | .055 | (2)-(3) = .09 | 1.17 87 
cSt (2)-(4) = .14| 1.79 96 
(3)-(4) = .05 65 74 





V. THE RELATION OF AN INDIVIDUAL’S COMPOSITE PLEASURE SCORE 
TO INTELLIGENCE, SEX, LENGTH OF ITEMS, AND VOCATIONAL 
INTEREST IN TEACHING AND PSYCHOLOGY 


In order to secure some light on the factors responsible for the 
variations in mean pleasure value from person to person, a more 
comprehensive analysis was made of a group of fifty-five students at 
one institution. These people were also given the Strong Vocational 
Interest Blank during the week in which the main test responses were 
secured, the blank later being scored for interest in ‘‘teacher”’ and 
“psychologist.” The other necessary information was secured inde- 
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pendently through the personnel records of the college. Table III 
reveals the intercorrelations thus obtained. 


TaBLe II].—RELATION oF THE ComPposITE PLEASURE INDEX TO VARIOUS FACTORS 
PRESUMED TO INFLUENCE IT 


VARIABLES CORRELATION 
Pleasure index and intelligence (DeCamp)........... .154 + .088 
nw we cena siisccsetieeus .002 + .090 
Pleasure index and Strong teaching score............ .310 + .082 
Pleasure index and Strong psychologist score......... .241 + .083 
Strong teaching score and intelligence............... .026 + .090 
Strong teaching score and age...................... .005 + .090 


Evidently the agreeableness of information presented in educa- 
tional psychology has little or no connection with a person’s native 
ability, a finding in harmony with the results of the senior author’s 
earlier study. The modest relation between item pleasure and the 
two occupations with which one would anticipate the closest kinship 
is somewhat disappointing, but enough to encourage further search in 
this direction. 

The influence of sex upon the hedonic score was tested by com- 
paring the average pleasure value for fifty-five men and fifty-five 
women. The male mean was .953 with SD of .573 and a SD,y. 
of .077; the female mean was .969 with a SD of .565 and SD,y. of .076. 
Clearly the sex element is a negligible factor, for the critical ratio 
of .15 which results gives only about fifty-six chances in one hundred 
that the index of the women is higher than that of the men. 

The effect of such a purely mechanical or external factor as the 
length of the statement was measured by computing the composite 
pleasure value of items with less than the average number of words in 
the sentences and contrasting this with the value for sentences with 
more than the mean number. On a priori grounds one may assume 
that tersely expressed ideas are more readily grasped and consequently 
more highly esteemed. The mean length of all the test items being 
13.48, a comparison was made of one hundred twenty-nine statements 
with thirteen words and under versus one hundred five of those with 
fourteen words or more: The index for the first set was .961 with a SD 
of .222 and a SD,,. of .019, while the index for the second group was 
.938 with a SD of .456 and a SD,y,. of 044. The critical ratio yields 
sixty-eight chances out of one hundred that a true difference is present, 
and while this is far from adequate, it suggests that brevity and (pre- 
sumably) simplicity of statement are contributing elements which 
heighten the affective tone of detailed information. 
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VI. SUPPLEMENTARY ANALYSIS OF THE CAUSES RESPONSIBLE 
FOR THE PREFERENTIAL AFFECTIVE STATUS OF CERTAIN CLASSES 
OF INFORMATION 


The search for some general influences which might be operative 
in determining the ratings of the subjects led the writers to present 
thirty-six experienced elementary and secondary school teachers with 
the following task: The ten items highest in mean pleasure value (I), 
the ten lowest, (III), and ten around the median index position 
(II), were reproduced on three separate sheets and given to the 
teachers individually with these instructions: 


Below are three classes of statements generally found in courses in educational 
psychology. These items probably do not have the same interest for you. With 
your help it may be possible to determine why these variations in preference exist. 
Please read each of the three groups and then answer the following questions: 

1. Which of the three groups of statements appeals to you most? 

2. Do you make more use of the facts in the group you prefer than in the others? 

3. Are some facts unpleasant because they seem to clash with your ideals or 
prejudices and does this affect your interest in them? 


4. What phase of educational psychology interests you most? Underline 


one of these four: Laws of learning, psychology of school subjects, individual differ- 
ences, native traits. 


5. Are statements you understand readily more interesting than those you do 
not immediately comprehend but can determine by a little concentrated analysis? 


6. Which group are you able to apply in your daily work more successfully 
than others? (I, II, III.) 


The responses of these teachers may be most conveniently clas- 
sified by adhering in the paragraphs below to the order of the inquiries 
just listed. 

1. Group I (with the ten highest items in pleasure value) was 
preferred by twenty-nine, Group II (the middle range) by five, and 
Group III (the ten lowest) by two out of a total of thirty-six judges. 
This suggests that teachers in the field react very much in accordance 
with the preferences of teachers in training. 

2. Thirty-two individuals employed most frequently the facts 
found in the group which they ranked first with respect to liking. 
Paradoxical as it may seem in view of our common belief about the 
relation of repetition to monotony, frequency as determined by the 
number of occasions a fact is encountered in actual life is highly 
associated with the pleasure experienced by the recognition thereof. 

3. Nineteen teachers claimed that the unpleasant implications 
of facts did not affect their interest in them, while seventeen main- 
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tained that it did. This response is difficult to interpret since it is 
impossible to determine the degree of rationalization involved. How- 
ever, a wise instructor will always pay heed to this factor in tracing 
the reception which his exposition receives. Certainly, the pleasure 
value of much biology, anthropology, and psychology—to mention 
but a few disciplines—is spoiled for a boy who has to integrate the 
facts there met with his earlier-established mores. 

4. Twenty-four teachers prefer the label ‘‘individual differences” ; 
six selected the ‘‘laws of learning”; five chose ‘‘native traits”; and 
only one expressed a liking for ‘‘psychology of school subjects.” 
It is strange that the caption ‘“‘individual differences’ should rank 
first in “interest”? when the items which experts would normally 
classify under this heading stand lowest in average pleasure value 
(see Table II)! Perhaps this is attributable to poor understand- 
ing of the actual content of these divisions or to mere irrational 
love of a name—like Southey’s desire to found a Pantisocracy in 


_ America because it contained a river with the beautiful appellation 


“Susquehanna.” 

5. Twenty-two members held that easily interpreted statements 
were more interesting to them; fourteen found those requiring some 
effort more attractive. How many of the minority were guilty of 
self-deception is uncertain, but it seems clear that a preference for 
readily comprehensible items is dominant. 

6. Thirty-two individuals found the items in Group I more appli- 
cable to the teacher’s work than any of the others. This is a finding 
of both theoretical and practical significance, and perfectly consistent 
with the record of paragraphs 1 and 2 above, to which it may serve as a 
sort of confirmation. 


VII. CONCLUSIONS 


For at least a generation professional pedagogical circles have been 
familiar with the difference between the logical and the psychological 
approach to the problems of teaching method, and yet it is amazing 
how little educational psychologists—who have been primarily 
responsible for propagating the distinction—have done to clarify the 
position with which they have been identified. Some unfriendly 
critics, observing what this antithesis has led to in the chaotic pro- 
cedures of many classrooms, have felt that the psychological must be 
the illogical! Of course it is nothing of the kind. Perhaps the real 
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difference would be better expressed by comparing the effects upon 
practice of a functional logic versus a structural logic. 

The technique and findings of this study should serve in some 
small way to illuminate this situation. Example is always more 
powerful than precept, and whatever good reasons (from the stand- 
point of a finished system of knowledge) may exist for organizing the 
data of educational psychology in the sequence: Original nature, 
learning phenomena, individual differences, etc., pedagogically (from 
the standpoint of enlisting the maximum energies of the learner) this 
order may considerably be improved by altering it to conform to the 
preferential arrangement of Table II. It is not to be understood that 
this order is necessarily the best way in which to present the facts of 
educational psychology—that will have to be tested by an experi- 
mental comparison of the effectiveness of different types of sequence. 
It is an open question whether the best permanent results are secured 
by placing the emotionally most satisfying section of the course first: 
perhaps a central or terminal position would be superior. But 
before this problem can be raised one must .now what the most agree- 
able divisions are, which is just the issue we have tried to investigate. 

Tolansky? has recently suggested that a reduction in manifest IQ 
or an awkward inconsistency in test performance may be due to 
emotionally-toned associations of the sort identified with the researches 
of Jung and Freud. It is almost certain that achievement test items 
are subject to similar distortions. Teachers and test-makers need to 
recognize that associative feeling-tone can act upon learning and 
examining in at least two ways: (1) It can reduce the numbers of 
questions done; (2) it can produce wrong answers by disturbing the 
emotions. Both from the standpoint of human happiness and 
sheer academic advancement the positive feeling-tone of knowledge 
must outweigh any negative components if a favorable attitude 
toward scholarship is to result. The concept of interest—a much 
abused word these days!—undoubtedly includes more than the affec- 
tivity with which this report has been mainly concerned, but that this 
element is a vital factor in determining the success of the purest 
intellectualistic curriculum cannot be gainsaid. 

To what extent the measure of enjoyment which students derive 
from various subject-matter items should influence the program of 
course work is a matter which awaits further research. Should the 
instructor in educational psychology, e.g., dwell long and ardently 
upon the “laws of learning” because his pupils like this part better 
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than the others, or should he do just the opposite and lavish more 
effort upon ‘‘individual differences’”’ because this field suffers from an 
initial affective handicap? Obviously no ready-made answer is 
available to this question, for curriculum-specialists know that the 
amount of satisfaction the learner derives from an item is only one 
of the criteria for justifying the inclusion of specific material in any 
school subject. But in conjunction with other considerations (other 
than frequency of use with which it appears to overlap greatly) the 
test of specific feeling-tone should serve to ensure a broad “‘ utilitarian” 
and quasi-scientific basis for the selection of topics of instruction. 
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APPENDIX 


| The two hundred thirty-four representative items in educational psychology are 
) arranged below in descending order of their average pleasure index. 





Pleasure 





Z scores Item 
value 
2.04 +2.62 | Other things being equal, the stronger the motive for learning, 
the faster will be the learning process. 

1.98 +2.48 | The best advice to parents of a fourteen-year-old girl who 
wants to go out at night to meet boys is to invite boys and 
girls frequently to the home. 

1.82 +2.10 | Distributed practice periods usually yield better returns for 

, the time spent in learning than concentrated practice. 

1.80 +2.05 | A child just learning to talk understands more words than 
he can use in talking. 

, 1.78 +2.00 | Learning with intent to remember aids in recall. 

1.77 +1.98 | Exercise of any reaction tends to make that reaction more 
prompt, certain and easy. 

1.67 +1.74 | Failure to learn school assignments is not always due to lack 
of learning capacity. 

1.62 +1.62 | Increase in rate of reading does not tend to weaken compre- 
hension. 

1.61 +1.60 | The same standard of performance should not be held for all 
pupils taking the same subject. 

1.60 +1.57 | The development of a child should determine the sequence of 
the materials of instruction. 

1.58 +1.52 | The purpose of educational psychology is to provide a scien- 
tific basis for the teaching art. 

1.574 | +1.51 | In mature oral reading, the eyes precede the voice. 

1.57 +1.50 | Mental tests are not valid or reliable enough to be made the 
sole basis of educational guidance. 

1.54 +1.43 | The best pupils in a grade will do approximately twice as 
many problems in arithmetic as the poorest will do, and 
usually with a greater degree of accuracy. 

1.53 +1.40 | The difference between work and play is due to the mental 
attitude. 

1.53 +1.40 | A measure of intelligence cannot be,based upon the quality of 
handwriting. ’ 

1.52 +1.38 | Making new information personal is a desirable method of 
arousing and maintaining interest. 

1.51 +1.36 | It is well for a teacher to realize that the backgrounds of per- 








ceptual experience of first grade pupils are very dissimilar. 
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Pleasure Z scores Item 

value 

1.51 +1.36 | Learning is a process of active participation. 

1.50 +1.33 | Interest and ability are not the only factors which should be 
considered in educational guidance. 

1.50 +1.33 | Children may inherit from their parents superior possibilities 
of learning school subjects. 

1.49 +1.31 | The Dalton plan conceives of the classroom as a laboratory. 

1.48 +1.29 | Knowledge of success or failure in learning is generally a spur 
to achievement. 

1.46 -+1.24 | General intelligence has been defined as the native capacity 
to learn. 

1.45 +1.21 | The shy child should not be left alone to find his own emo- 
tional outlets. 

1.45 +1.21 | Stimuli which appeal through several sense avenues are the 
most effective in insuring response. 

1.44 +1.19 | Skill is dependent upon habit. 

1.41 +1.12 | The laws of learning hold for both the genius and the normal 
person. 

1.41 +1.12 | The MA is generally a better basis upon which to organize 
public school classes than is the CA. 

1.40 +1.10 | Every normal group has as many very superior as very 
inferior intelligences. 

1.40 +1.10 | In general, there is a positive correlation between intelligence 
scores and academic success. 

1.39 +1.07 | Learning is a native capacity. 

1.38 +1.05 | Attention may be habitual as well as instinctive. 

1.37 +1.02 | The Binet test shows that girls are as bright as boys. 

1.37 +1.02 | Rhythm in handwriting depends upon a coordination of 
movement. 

1.36 +1.00 | The intelligence quotient is found by dividing the mental age 
by the chronological age. 

1.35 + .98 | The principle of conditioned response is clearly shown in 
feeling a thrill when your college song is sung. 

1.34 + .95 | It is easier to form a new habit than to change an old one. 

1.32 + .90 | A comparison of the mental traits of one twin with those of 
another shows that twins resemble each other significantly 
more closely than do ordinary brothers and sisters. 

1.31 + .88 | Acquired characteristics are not transmissible biologically. 

1.31 + .88 | Overlearning slows the rate of forgetting. 

1.30 + .86 | One of the marks of high intelligence is the ability to make 
many responses to a given stimulus. 

1.30 + .86 | School tasks should not be employed as punishments. 

1.30 + .86 | A conditioned reflex is a learned response. 
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Pleasure 











inline Z scores Item 

1.30 + .86 | When a pupil can add only by making use of his fingers, he is 
in need of drill to render the responses automatic. 

1.29 + .83 | Relative brightness of letters and background is a conspicuous 
factor in the legibility of the printed page. 

1.29 + .83 | The changes that take place in a person’s behavior sometimes 
constitute learning. 

1.28 + .81 | By validity is meant the degree to which a test measures that 
which it purports to measure. 

1.27 + .79 | The average IQ of high school pupils is above the average for 
the population in general. 

1.27 + .79 | The amount of transfer of training which may take place 
between two situations may be augmented by appropriate 
methods of teaching. 

1.26 + .76 | Character is the sum total of our habits. 

1.26 + .76 | If a test measures that which it purports to measure it is said 
to be valid. 

1.25 + .74 | The spelling period should not, as a rule, be more than fifteen 
minutes in length. 

1.23 + .69 | The habit clinic attempts to supervise the mental hygiene of 
the pre-school child. 

1.22 + .67 | The variability in reading ability of individuals increases with ° 
age. 

1.22 + .67 | Achild can learn to speak any language with equal facility. - 

1.21 + .64 | Children have an instinctive tendency to struggle if held 
tightly. 

1.21 + .64 | Imagination in children aged three to seven is predominantly 
dramatic. 

1.21 + .64 | Frequent objective and comparable measurements of achieve- 
ment form the safest basis for constructive supervision of 
teaching. 

1.20 + .62 | The range is the difference between the highest and the lowest 
score. 

1.20 + .62 | Not all instinctive traits are desirable under present day 
conditions. 

1.19 + .60 | The focus of educational effort has recently shifted from the 
adolescent period to the pre-school stage. 

1.19 + .60 | An extrovert is usually characterized by an interest in social 

affairs. 

1.18 + .57 | Latin cannot be justified on its disciplinarian value alone. 

1.17 + .55 | Pre-adolescents are less self-conscious and sensitive than 
adolescents. 

1.17 + .55 | The curve of satisfyingness generally coincides with the curve 


of achievement. 
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Pi Fe | Z scores Item 
value 
1.16 + .52 |In general, learners are educated by their own mental 


responses rather than by the influences provided by the 
teachers. 

Words are generally scaled in difficulty according to the per- 
centage of pupils who can spell them correctly. 

Ideas will not become effective unless they have led to actions 
with satisfying results. 

Testing should be a part of the stated administrative routine, 
and should not be left entirely to the initiative of the indi- 
vidual teachers. 

What we learn is a reaction. 

Very bright children are as strong physically as children of 
average mental ability. 

The growth curve of intelligence from birth to the age of 
twenty is most accurately represented by a curved line. 

Boys mature physiologically a little later on the average than 
do girls. 

The beginning and end of adolescence are hard to determine. 

The instructional procedures in most high schools are adapted 
to the average students. 

Mendel’s law is a generalization based on observing the 
results of cross breeding. 

The modifiability of our nervous system determines our 
capacity to learn. 

Heterogeneous grouping is undesirable because it works an 
injustice to the bright pupils. 

Instincts need to be redirected in order to adapt them to social 
demands. 

Day dreaming is most common during adolescence. 

Rationalization is a form of self-justification. 

Biological heredity is a large factor in determining mental 
capacity. ; 

Improvement in addition will alter one’s ability in multiplica- 
tion, because the eye movements and certain other processes 
are in part common to the two functions. 

Building up meanings for sensations which one receives always 
involves learning. 

Habits can be built only when there is some form of reaction. 

The more habits we have the more free we are to attend to 
learning new things. 

In the senior high school years, grade norms and norms based 
on equal learning opportunity, are more meaningful for 
educational guidance than chronological age norms. 
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— Z scores Item 
™ 1.07 + .31 | The brighter the student the less he needs to study in order to 
“ reach a given standard of academic success. 
1.06 + .29 | A baby instinctively shows fear at removal of support. 
as 1.06 + .29 | The extent to which a pupil has mastered the subjects in his — 
school curriculum is most directly indicated by his EA. 
sie 1.06 + .29 | Ideas of right and wrong are dependent on social environment. 
1.06 + .29 | Self-assertiveness is sometimes a compensation for shyness. 
om 1.05 + .26 | Suppression of articulatory movements aids in the speed or 
ii accuracy of column addition. 
1.04 + .24 | The pubertal development of the two sexes occurs at different 
times. 
of 1.04 + .24 | If a reaction occurs when there has been no opportunity for 
learning it, the reaction is instinctive. 
of 1.04 + .24 | Two groups may have the same means but differ in variability. 
1.04 + .24 | The natural or direct method of teaching foreign language 
_ means that all instruction and responses are in the foreign 
tongue. 
‘ 1.03 + .21 | Acontrol group is necessary to every sound study of transfer. 
ted 1.03 + .21 | The curve of mental and physical growth is rapid at first and oo 
then slows down. a | 
the 1.03 | + .21 | Variation between individuals in improvement is most likely Bi 
due to original nature. Abe 
he 1.03 + .21 | The word “thinking” is popularly used to reflect memory, = 
imagination, and problem-solving. i; 
in 1.02 + .19 | When reading is taught by the word method rather than by ee 
the letter method, instruction in spelling should be begun , 
nial after reading. ; abe 
1.02 + .19 | When a practice curve increases very slowly at first and then ay 
rises sharply, it is said to be positively accelerated. M r 
1.01 + .17 | Ordinarily, it is a bad policy to form a bond which will later HY 
tal have to be broken. By | 
1.01 + .17 | A mature writer can form letters correctly when he is blind- > I ) 
ica- folded by making use of kinaesthetic control. | | | 
on 1.00 + .14 | The part method of memorizing is seldom the equal of the bis 
whole method. a i 
ays 1.00 + .14 | The quality of school work is no index of intelligence unless 1 ied 
age is taken into account. Bt : 
"a 1.00 +, 14 | The rate and comprehension of oral reading is less than that of || 
i to silent reading. we 
.99 + .12 | When a mature reader reads increasingly difficult material iy . 
saad the number of fixations of the eyes increases, other things ’ I 
for being equal. 1 | 
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awry Z scores Item 
value 

.98 + .10 | Ability to read has appreciable influence upon mathematical 
progress. 

.98 + .10 | Limited sampling of a student’s knowledge of subject-matter 
is a major defect of the essay type of examination. 

.98 + .10 | Women are superior to men in memory tests. 

.98 + .10 | An ideal may be described psychologically as consisting of a 
generalized idea with an emotional connection which con- 
trols one’s conduct. 

.98 + .10 | The probability curve applies equally well to both mental 
and physical traits. 

.97 + .07 | The good thinker is one who formulates many hypotheses, 
including many wrong ones. 

.97 + .07 | Overemphasis upon language in teaching is dangerous because 
words without adequate perceptual experiences are usually 
meaningless. 

.97 + .07 | Real modification of the behavior of an adult takes place 
sometimes without awareness of the change. 

.97 + .07 | Most of the child’s fears are learned. 

.97 + .07 | A person of thirty learns as rapidly as a lad of fourteen. 

.97 + .07 | All children will not be able to make the same grade if they 
put forth the same effort. 

. 96 + .07 |, Practice may make imperfect as well as perfect. » 

.95 + .02 | Habits are decidedly specific. 

.95 + .02 | Ebbinghaus was the first systematic user of nonsense syllables 
in memory studies. 

.93 — .02 | The recognition-span in oral reading is smaller than in silent 
reading. 

.93 — .02 | When the MA is three and the CA is five years, the IQ is 
sixty. 

.92 — .05 | Our ordinary description of individual differences are either 
quantitative or qualitative. 

.91 — .07 | Imagery tends to emerge at points where our thinking is 
baffled. 

91 — .07 | The Binet test is an individual test. 

91 — .07 | Facial expression is no index of intelligence. 

91 — .07 | The coefficient of correlation is a measure of relationship 
between two sets of scores. 

91 — .07 | The theory of concomitant development stresses the con- 
tinuous and gradual growth of mental processes. 

.90 — .10 | The mental age of a child as determined by the Stanford- 
Binet test correlates positively with his ability in reading. 

.89 — .12 | The adolescent is usually sure that his inner experiences are 
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Variations in Affective Tone of Different Areas 














Pleasure | 7 scores Item 
value 

.89 — .12 | Ina given statistical distribution the class intervals should be 
equal. 

.89 — .12 | Professional parents produce an undue proportion of gifted 
children. 

.89 — .12 | Rationalization is a term applied to make facts support one’s 
wishes. 

.88 — .14 | When we attach meaning to a sensation we have a percept. 

.88 — .14 | Equal practice tends to increase differences in achievement. 

88 — .14 | There is no particular tendency for great men to spring from 
humble circumstances. 

.88 — .14 | Children who develop early physiologically as a rule do as 
well in school as children who develop late. 

.88 — .14 | The coefficients of correlation between mental abilities when 
any large group of people is tested are generally positive. 

.88 — .14 | Organic changes under emotional stress are generally adapta- 
tions to increase the motor efficiency of the bodily 
mechanism. 

. 87 — .17 | Superior handwriting results in a school system cannot be 
obtained simply by devoting more time to teaching it. 

87 — .17 | For most statistical purposes the mean is a better measure 
of central tendency than the median. 

86 — .19 | The introvert tends to solve his conflicts by day dreaming. 

85 — .21 | Practice exercises should be corrected by the pupils who have 
taken them. 

83 — .26 | Efficient thinking does not depend upon imagery. 

.83 — .26 | According to a prevalent theory of physiological psychology, 
learning is always accompanied by a change in synaptic 
resistance. 

. 83 — .26 | The habit of repeating questions often leads to habitual 
inattention. 

. 83 — .26 | In the past, a larger proportion of classical students have won 
scholastic honors than any other group. 

. 83 — .26 | Varying the concomitants is a helpful device in learning to 
generalize. 

.83 — .26 | Ina normal distribution, the mode and the mean coincide. 

.82 — .29 | Prognostic tests are most useful in predicting future achieve- 
ment. 

.81 | — .31 | In motor learning, consciousness should be centered on the 
objective result of the movement, not on the movement 
itself. 

81 — .31 | Occupational intelligence limits have been partially 
determined. 

.80 — .33 | Judd believes that transfer is contingent upon the mode of 
instruction, and is not a function of the subject as such. 
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Pleasure Z scores Item 
value 

.79 — .36 | Large divergencies from the norm occur less often than small 
ones. 

.78 — .38 | The logical order of arrangement means an arrangement of 
materials showing what place each fact dealt with has in the 
comprehensive system of the completed science. 

.78 — .38 | One of the advantages of the whole method of memorizing is 
that all the needed associations are approximately equal in 
strength. 

.77 — .40 | Ideas sometimes cause us to disbelieve the evidence of our 
senses. 

.77 — .40 | Mental tests measure intellectual capacity indirectly. 

.76 — .43 | An extensive study of the similarities of twins in special 
mental traits was made by Thorndike. 

.76 — .43 | A plateau in the learning curve indicates absence of improve- 
ment in the units measured. 

.76 — .43 | Unconditioned responses are responses that are normal to 
their respective stimuli. 

.75 — .45 | There is normally much variability in physiological age. 

.74 — .48 | Using the AR as a basis for school marks tends to make marks 
depend very largely upon effort or studiousness. 

.74 — .48|A strong emotion tends to be accompanied by decreased 
activity of the digestive system. 

.74 — .48 | A gifted man may have a dull son. 

.73 — .50 | Children who work most rapidly are also most accurate. 

.73 — .50 | More time may profitably be given to reciting than to the 
reading of the content to be mastered. 

72 — .52 | A child’s idea of right is that which ‘‘works.”’ 

.72 — .52}A neural connection is weakened when the response is 
unpleasant. 

.70 — .57 | An attempt to make a left-handed child right-handed may 
upset the ability to form habits and the speech center. 

.69 — .60 | Where marks are based on a percentage system, (seventy to 
one hundred per cent) the distribution curve tends to reveal 
modal points at regular five-unit intervals. 

.68 — .62 | An interval of time is perceived to be longer than it really is 
when it is filled with events which are tedious and 
monotonous. 

.68 — .62 | The time of teething and the amount of cartilage in the carpal 
bones are used to measure physiological maturity. 

.68 — .62 | A healthy child may be intellectually dull. 

.68 — .62 | Emotional excitement does not increase efficiency in learning. 

.67 — .64 |The youngest children in a class tend to be the brightest in 
the class. 
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Variations in Affective Tone of Different Areas 














pg Z scores Item 

66 — .67 | The AR is usually computed from the results of both achieve- 
ment and intelligence tests. 

. 66 — .67 | According to the best available evidence, individual differ- 
ences are caused by the operation of a large number of 
relatively independent factors of nearly equal weight. 

.65 — .69 | Adolescents of less than one hundred IQ will not be likely to 
achieve satisfying success in college. 

.65 — .69 | When any habit becomes fixed, it is annoying to disturb it. 

.65 — .69 | Negative correlations between desirable traits are rare. 

.64 — .71 | Intelligence tests assume that the people tested have had 
equal opportunities to learn. 

.64 — .71 | The term “apperception’”’ was introduced into educational 
discussion by Herbart. 

.63 — .74 | Monroe devised a silent reading test. 

.62 — .76 | People who have the same IQ’s have the same degree of 
brightness. 

.61 — .79 |The doctrine of formal discipline grew out of the older — 
“Faculty psychology.” 

.61 — .79 | The will appears free in those instances where intra-organic 
stimuli are more effective than external ones in dominating 
behavior. 

.60 — .81 |A pupil who works to the maximum of his capacity cannot 
have an accomplishment quotient of less than one hundred. 

.58 — .86 | The average intercorrelations among mental traits in siblings 
are about the same as those for physical traits. 

55 — .93 | The intelligence of average pupils in a grade is frequently 
over-estimated. 

54 — .95 | The copious use of mnemonic devices does not strengthen the 
memory. 

.53 — .98 | Strict repression encourages lying and deceit. 

.52 —1.00 | Control by fear is apt to develop dishonesty. 

.52 —1.00 | Kittens catch mice by instinct. 

51 —1.02 | Franzen suggests use of the AQ as a school mark. - 

51 —1.02 | The chief difference between the reflexes and instincts is in 
degree of complexity. 

.49 —1.07 | EA/MA = EQ = AQ. 

.49 —1.07 | Stuttering is usually associated with emotional disturbances. 

.48 —1.10 | Competition is not the best form of motivation and should be 
used sparingly. 

~ 44 —1.19 | A child’s attitude toward cheating is primarily the result of 
his group mores. 

44 —1.19 | Initial performance is highly symptomatic of final achieve- 
ment. 








ie Les aa 


+e 
Semmens . 





‘ 
we 3 
tat 

ta 

A}. 

iy 

H 

my 
‘_y 
Be 


tie 
ae 
a 
me i 
4 
: 
ral, 
en 
ie 1 
ey 
' 


oe oe 
=e 


A? 


bh A 
‘ 


agate ot he 


<= oe ae oer 
Lert a= cores ede 
pete 6 lee eae e : it >. hae 
Emtec nen Fake 
rey ee 


a 


Seas 


% 
Ls a 
ay 

* 

¥ 
ae 9 ‘ 





- Fy 
sires 














134 The Journal of Educational Psychology 
Steneure Z scores Item 
value 

43 —1.21 | Asa rule a slow learner retains less than a rapid learner. 

42 —1.24 | Children of very superior intelligence are likely to be mis- 
understood in school. 

41 —1.26 | The roots of most functional mental disorders have been 
traced back into the period of childhood. 

.38 —1.33 | The rank-difference formula for correlation requires that the 
differences in the rankings of the two variables be squared. 

.37 —1.36 | The ideo-motor theory states that the presence of any mental 
content inevitably leads toward the appropriate muscular 
response. 

.36 —1.38 | Most criminals begin their anti-social career during 
adolescence. 

.35 —1.40 | Weare controlled more by our emotions than by our intellect. 

34 —1.43 | Mental and physical growth run parallel. 

.34 —1.43 | The school is frequently a great offender in making children 
try to do things beyond their capacities. 

.30 —1.52 | The materials which one reads are perceived only while the 
eyes are stationary. 

.30 —1.52 | Studying other languages generally aids little in the mastery 
of one’s own. 

.29 —1.55 | The favoring of foveal to peripheral vision illustrates original 
attentiveness. 

.28 —1.57 | Insisting upon neat writing and arrangement in arithmetic 
papers will not tend to make pupils write better-looking 
spelling and geography assignments. 

.27 —1.60 | Dull pupils become encouraged by being placed in a class 
where all other pupils are dull. 

.27 —1.60 | The repetition of a grade or course by “‘flunkers’’ results in 
little improvement of their accomplishment. 

21 —1.74 | The conditioned response is relatively temporary and 
unstable. 

.18 —1.81 | The educational accomplishment of Jewish children is higher 
than that of most any other racial group. 

.17 —1.83 | The tendency to be afraid of a loud noise is due to instinct. 

—.162 | —1.85 | Other things being equal, the reliability of a test is increased 
by increasing its length. 

.16 —1.86 | There is little doubt that at least thirty per cent of school 
children are handicapped by defective vision. 

15 —1.88 | The original nature of the Neolithic man was not very dif- 
ferent from our own. 

.144 | —1.90 | Unpopular people are particularly apt to become social or 
religious reformers as a compensation. 
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Variations in Affective Tone of Different Areas 








Pleasure Z scores Item 
value 
14 —1.90 | Training in the estimation of short lines results in no improve- 
ment in estimating longer lines. 
12 —1.95 | Idiocy is the lowest grade of feeblemindedness. 
01 —2.21 | The best pupil in the first grade spells as well as the poorest 


pupil in the eighth grade. 
.001 | —2.24 | There is no definite dividing line between feeblemindedness 


and normal intelligence. 

— .07 —2.40 | Mental defectives show a tendency to drift together and form AM 
relatively isolated intermarrying groups. at 

— .26 —3.14 | The school tends to promote children by age rather than by | 
ability. 

— .29 —3.21 | About two per cent of the child population of a community 4 


is too defective mentally to profit from instruction in the ie 
public schools. . | 
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if RETESTS AFTER TEN YEARS! a 


V4 IRVING LORGE 
With the assistance of Hyman Brandt and the Staff of the Division of Psychology, 








Bt Institute of Educational Research,? Teachers College, Columbia University : 
Between December 1921 and November 1922 over two thousand d 
- children were tested with tests of general intelligence, clerical capacity, re 
and mechanical adroitness. These tests were administered as prog- 
nosis instruments in connection with a study of vocational success in 
relation to abilities shown at or near age fourteen. Of this number, 
approximately half were boys, and it is with retests of them that this be 
paper is concerned. 
The boys were sampled in two ways: 
1. The Boys Age Group were all the boys of age 13.0 to 15.0 in a 
ih public elementary school in New York City. The school was located - 
i t in a section of low economic status and the children attending the 0 
te school were of foreign parentage, primarily of South European stock. 1 
" | ; 2. The Boys Grade Group included a fairly representative sampling os 
ee of second term of the eighth grade of elementary schools in Manhattan. 
i : The Boys Age Group was tested in December 1921 and the Boys 
Pk Grade Group in November 1922 with the following tests among 
= others. 
ak 1. Thorndike-McCall Reading Scale. ~ 
a Ei 2. I. E. R. Arithmetic. 0 
be: 3. Stenquist Assembly Test. f 
y i 4, I. E. R. General Clerical Test. a 
‘ (| In addition the Boys Grade Group was measured for height and : 
: 5 weight. The testing was done by trained examiners. Each boy who ti 
= te was well enough to be in school on the day of test was examined. ‘ 
£4 After ten years, (during which time all the individuals were 
ae being followed up for work career records) these boys were asked to le 
a cooperate with the Inquiry by coming to Teachers College for reex- - 
-. amination. The matter was entirely voluntary for each individual. T 
: About one-fifth of the entire group came for the retests which were , 
se 1 This investigation is part of a study made possible by mans from the Car- a 
negie Corporation and from the Commonwealth Fund. 0) 


2 The authors wish to express their appreciation for the siaeeiaatiien of Dr. 
Ella Woodyard, Mr. Edward G. Stephany, Mrs. Zaida F. Metcalfe, Miss Eleanor 
Robinson, and Miss Laura Buchman in the retest program. 
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arranged for three weeks, one in November 1932, one in December 
1932, and the third in March 1933. The boys were retested with the 
same tests that they had taken more than ten years before. 

The selection bias of the volunteers can be measured in two ways: 
The differences in mean and standard deviations of the total group and 
of the volunteers in the 1921-1922 tests, and the)X? between the 
distribution of the original group scores and that of the volunteer 
retest group. Table I shows the differences in means and standard 
Taste I1.—TxHe Means AND STANDARD DEVIATIONS) OF THE VARIOUS TESTS FOR 


THE EnTiIrE Group In 1921-1922 AND oF THE VOLUNTEERS ON THE 
Same Tests aT THE SAME TIME 





Thorndike-McCall , . : . 
+ giao I. E. R. arithmetic/Stenquist assembly 


test 





n |Mean| SD nm |Mean| SD n |Mean| SD 



































Original group....... 862) 52.84) 8.10 | 863) 10.00) 2.36 | 856) 41.53) 20.99 
The retest group..... 163) 53.12) 8.51 | 164) 10.19) 2.74 | 163) 42.49) 22.55 
I. E. R. general ) ; . 
pritieyt nen Height Weight 





n |Mean| SD n |Mean| §S n |Mean| SD 





Original group....... 851) 44.53) 11.07) 700) 62.08) 3.78 | 700|)105.33) 21.26 
The retest group... .. 160} 42.84) 12.51) 132) 61.36) 3.77 | 132) 99.52) 21.77 
































deviations of the original Grade Group plus the boys of the Age 
Group who were in 8B at the time of the 1921-1922 test and the statis- 
tics on the entire volunteer group regardless of whether they are from 
the Age or Grade Group.! 

Three of the means of the retest group are greater, and three are 
less than the corresponding means of the Boys Grade Group. In 
no instance is the difference between the group statistically significant. 
The X? test for comparing the correspondence of two distributions was 
applied to the Boys Grade Group distributions of the original group 
and volunteer group scores. The P corresponding to the X? obtained 
on each test were for 





1 Of the Age Group but thirty-nine volunteered for the retests of which volun- 
teers eight were sampled in the same way as the Grade Group. 
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Thorndike-McCall reading scale......................00005. 45 
I aed etd swt SES. wlniginie late nies diene hie ww oS .25 
EL ELLE PLOT TEE. OE PLS Pe . 86 
rr ee wre ob as een eee ees eoens eed be .25 
DRS ce cere wees Che «ois eddie we kame dine ahaa es Ae .27 
0 SE Bye CRs aE ee Gaye emer, eee To Rena 31 


All the values of P lie between .2 and .9. There is no reason to believe 


_ that the volunteer group is any different from the original group of 


which it was a part.1 Furthermore, the X? derived from each com- 
parison may be summed to give a P for all the comparisons that 
were made. The P for the summed X? is obtained by the formula 
(2X?) — (2n—1)” when n is in excess of thirty. The value of P 
for all the comparisons is .25. In essence, there is no reason to believe 
that the volunteer group is different from the group of which it was a 
part. 

The correlation coefficients between the 1921-1922 test and the 
1932-1933 retest were for 


Thorndike-McCall reading scale.................. .57 (n = 163) 
RE a. a. os bese seweewen .60 (n = 164) 
Stenquist assembly test.........................- .66 (n = 163) 
I. E. R. general clerical test...................6.. .63 (n = 160) 
ae eee i itrs dhe 2 vide WG bp FO a tad ti .47 (n = 132) 
aah AY TS ila a thas ede 4o-n Kae hank Ma .63 (n = 132) 


The six coefficients can be interpreted only in light of their relia- 
bility coefficients. The reliabilities of the various tests in 1921-1922 
were as follows:? 


Thorndike-McCall reading scale. 





oe hated g Saas BAW a wipe. oe hrh bind eee Alaerw a ray = .80 to .90 
I Ded Sear Soe eat eas che nde abe ees rir = .70 to .80 
I. E. R. arithmetic. 
EEE ET ET ET Pee eae riy = .75 to .85 
ha ih Ok a Sn akon ahd whine Sige aig aa waein rit = .70 to .80 
Stenquist assembly test. 
hn. dah heey wan edh obhaedakans eee ray = .70 
EE UR RD nS 9 es ee eee ee ria = .60 
I. E. R. general clerical test. 
SEN EEO P TOTES COTS OPE CTT OT eee ee Ee rir = .85 to .92 
I Oe reer: rE ee rix = .80 to .87 
1 Fisher, R. A.: “‘Statistical Methods for Research Workers.” Third edition, 
1930, p. 77. 


2 Thorndike et al.: ‘Prediction of Vocational Success.’”’ Commonwealth 
Fund, Appendix XIII. 
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Since the retest group was largely a Grade Group, the estimated 
reliabilities will be less than that of the age group. It is assumed that 
the reliability (retest with time interval negligible) for each test is as 
follows: 


Thorndike-McCall reading scale........................000- 717 
ee. ee eae ned pac eeunwihnae .76 
ts nn ss 5 os ving a 6 u op New a hg eee .62 
Be iy ep reins peee ra ke 84 


After ten years the retest coefficients have changed to .57, .60, .66 
and .63 respectively. The only test that did not lose in the retest was 
the Stenquist Assembly. This may be due to an underestimation of 
its initial reliability, or to the fact that its unreliability was so great 
that its retest reliability is a function of the reliability of other traits, 
or of chance. 

Woodyard,' basing her results on retest coefficients up to one year 
concluded ‘‘The lapse of time between tests is one factor in the varia- 
bility of an individual’s performance. This factor, however, is of 
small importance in estimating the total variability of an individual.” 
The differences between reliability and retest coefficients that we have 
found are not in agreement with this conclusion. 

The effect of time upon retest coefficients has rarely been a special 
topic in educational measurements. The principal focus of attention 
has been the sub-problem of the constancy of the IQ, especially as 
derived from the Stanford Binet. Gates and LaSalle,? however, 
report retest coefficients upon various intelligence and educational 
tests after zero, four, eight, twelve, sixteen and twenty months. 
“The subjects utilized were pupils mainly from grades III, IV, V, 
and VI . . . about seventy-five in all. They were given the Stanford- 
Binet once in each year, mainly during the first semester; the National 
Intelligence Test once in October of each year, and a battery of tests 
in reading comprehension (Thorndike-McCall); reading rate (two 
forms of the Courtis or Burgess or one of each); arithmetic (all four 
forms of the Woody) and spelling (sixty words from the Ayres Scale).”’ 

The correlation of the first test with the retests four, eight, twelve, 
sixteen and twenty months later are as follows: 





* Woodyard, Ella: ‘‘The Effect of Time Upon Variability.” Teachers College 
Columbia University, Bureau of Publications, 1926. 

* Gates, A. I. and LaSalle Jessie: The relative predictive values of certain 
intelligence and educational achievement upon intelligence tests scores. Journal 
of Educational Psychology, 1923, pp. 517-539. 
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Retest after months 
4 8 12 16 20 
Reading comprehension.............. 7 | .78 | .70 | .76 75 
ES SEE ipl Se aan EY oe .80 .80 .82 .78 .74 
RS Ska poe neediest eaaen .91 . 86 91 .89 .84 
2  SEROVISSS HABE eee te . oe ce aor 48 




















Through the courtesy of Mrs. Zaida F. Metcalfe, the follow- 
ing retest coefficients after five months for fifty-three girls may be 
reported: 


sin as bine Cia. ig bind w web baw suas ewe bbs .73 
Thorndike-McCall + arithmetic............................ .63 
da aS anlcne so baal Mae nebo} oa .63 


There can be no doubt that the retest coefficients are diminished 
with time. Mr. Robert Thorndike using the data of thirteen experi- 
menters with retests of the Stanford Binet, thirty-six correlations in 
all, found that z = 1.415 — .00916¢ where z is the Fisher transforma- 
tion for correlation coefficients, z = 4{log. (1 +1) — log. (1 —17)}, 
and ¢ is the time in months between test and retest.!_ The table 
Thorndike reports shows the relation between retest coefficients on 
the Stanford-Binet and time, ¢ in months: 


t r 

a er ea as dee ad oe ee awe iiale bs aoe .889 
ee Reh CBs Sie end ek 6 Saad F 0d ROLW Ko ROR Teak neat . 868 
eae a et ah ae ee saede abuaes is ons .843 
I a eich aac nea Ce eel La a we a nda dn AMA bam ae Week és .814 
AE EG SEN LA AP ee A ee Pee PE ee eT OOS .781 
ee elie pie eek Aa SE eet ied a cleiel .743 
I eee ee re a ee ls Ck ee ie es ee a .698 


In general, therefore, time diminishes the predictive value of test 
for itself. This may be attributed to several causes: The initial 
reliability, growth, environmental factors and time, and the limita- 
tions of the test itself. The probability is that the limitation of the 
test itself is the primary factor in the reduction of the retest coefficient. 
Tests are designed for a limited age or grade range. Whenever tests 





1 Thorndike, R. L.: The effect of the interval between test and retest on the 
constancy of the IQ. Journal of Educational Psychology, Vol. XXIV, 1933, 
pp. 543-549. 
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are applied outside the designed range the retest or reliability coef- 
ficient must of necessity be reduced. 

It is true that the retest coefficients when corrected for attenuation 
indicate that the same traits or abilities are being measured ten years 
later as in the original tests. The value of the coefficients, however, 
deviates sufficiently from 1.00 to bring attention anew to the task of 
making better tests for the purpose of prediction of status at some 
future date. The corrected correlations between tests ten years apart 
would be (assuming the same reliability, at each testing) .74 for 
Thorndike-McCall, .79 for I. E. R. Arithmetic, 1.06 for Stenquist and 
.75 for the I. E. R. General Clerical. 

Implicit in the guidance program, or in any prediction, are the 
assumptions that standardized tests are valid and reliable. Many 
guidance experts seem to believe that a standard test score as a basis 
of prediction will indicate the relative standing of the individual as 
accurately the next day, a month, a year, or even ten years later. 
They assume that correlation between a score and an end criterion 
some n years away is practically as large as a test score taken at the 
expiration of the n years and the same end point criterion. This is 
tantamount to assuming that the correlation between the original test 
and the retest n years later is high. This is not true of many of the 
standardized tests that have been developed to date. 

Tests for guidance must consider the retest reliability over time. 


Before guidance can make further progress, it must develop reliable 
instruments. 


_ —_— 
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THE FACTOR THEORY AND ITS TROUBLES: 
IV. UNIQUENESS OF G 


C. SPEARMAN 


1. IDENTIFICATION OF G BY REFERENCE VALUES 


In previous numbers of this Journal we have discussed several 
of the troubles by which the theory of Two Factors has from time 
to time been seriously disturbed. But in all these cases, the difficulty 
was as to whether or not the theory had been adequately corroborated 
by observation. Let us now turn to a second great group of troubles; 
these involve the question as to whether or not the factors, even if 
corroborated by observation, possess the virtue of being ‘ determi- 
nate’’; or, as it is sometimes put, ‘‘unique.’”! 

In perhaps its most acute forms, the question may be posed as 
follows. Supposing that each of two different sets of variables have 
zero tetrads and so can be analysed into terms of a general factor and 
independent specific factors. Can then the two general factors be 
assumed to be one and the same? As regards this question, it would 
seem that two diametrically opposed attitudes have been adopted 
(and even at times by the same writer!). Some critics strenuously 
insist that the two general factors may easily be different, and that 
therefore all work based on the assumption of their sameness must 
be inexorably rejected. Here are some criticisms on this basis. 

“When a gis found, it may or may not be the g which another found yesterday.” 
“The individual has as many g’s as you administer tests.” “Unless the same g 


can be found in all, the concept is as useless as an I.Q. which must be restricted 
to the Terman revision.” 


But other critics appear to entertain small anxiety on the matter; they 
adopt this assumption without admonishing the reader that it is pre- 


carious. Kelley, for instance, seems to dispose of it in the following 
few words: 


We will assume that the general factor found in every set of four variables 
taken from the nine is the same throughout.? 


The present author, for his part, would suggest a middle course. He 
would not offhand either reject or adopt the sameness or uniqueness, 





1 Unfortunately, writers use this word in different senses. In our previous 
articles, we found it means anything that only occurs on one occasion. Here, it 
is anything that takes only one value. 

? Crossroads in the Mind of Man, p. 101. 
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but rather would seek to ascertain definitely the conditions upon 
which this uniqueness depends. 

Now, one condition of uniqueness has not only been asserted by 
my collaborators and myself, but has been so extensively employed 
by us that our whole psychological system may be said to have been 
built upon it. We have maintained that any number of general 
factors must necessarily be the same when all the sets have zero 
tetrads and every pair of sets has at least two variables in common. By 
the aid of such common variables—entitled reference values—the 
results of very numerous and diversified experiments have been 
integrated by us into one unitary edifice. Evidence for this thesis of 
ours may be found in ‘‘ The Abilities of Man” (pp. 223 and xxi-xxiii). 
But since the proof seems to have been little noticed, it will here be 
repeated more explicitly (see Appendix I). 


2. CONSERVATION OF g ON “‘ TRANSFORMATION ”’ 


Another conceivable condition of uniqueness has been brought 
to light by the exceptionally fundamental work of E. B. Wilson. 
Suppose that any set. of variables is ‘‘ hierarchical”; that is to say, has 
zero tetrads, so that each variable can be analysed into terms of a 
general factor and an independent specific one. Then from these 
primary variables, and in an infinity of ways, a second set can be 
derived by what is called linear transformation; here, each of the new 
variables is made up of the primary ones variously weighted. Thus, 


if primarily we have Nn scores of N individuals, z, y, . . . z, at n 
tests, Pi, Po, . . . Pn» WE May combine these scores into n new sets 
for each individual—call them d;, ds, . . . d,— so that we get the 
following: 

d, = Wisp + Wipe + ++ +> + WinPn 

dz = Warp: + Woop + - + + + WanPn (A) 

d, _ WaiPn + Wa2Pn tT °°’ + WanPn 


with n? weights, W;;. These new derived scores, the d’s, Wilson says, 


contain all the information of the old scores because they can be solved for the 
old scores, p, but the information is differently assembled.! 


But this transformation he shows to involve several vital problems. 
What correlations will the new scores have with one another; in 
particular, will they like the primary ones be hierarchical? He 


1 Science, Vol. LXVII, 1928, pp. 247. 
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answers that they may or may not be so. Moreover even if they 
are so, he goes on, the new g may have any value, regardless of 
that of the primary one. And from these facts he makes an alarming 
deduction: 


What does this leave of the concept of the intelligence of an individual z as 
measured by g? Apparently only that it is relative to the set up, which is the 
obvious proposition that I set out to prove. 


And indeed, the situation appears to be parlous enough. Let any 
psychologist devise any set of intelligence tests that are irreproachably 
hierarchical; let him apply them in a school and, on the strength of the 
scores, speed the “‘ geniuses” up to higher grades, whilst the ‘‘ morons” 
are packed off to carpentry. Then—if things are as bad as they seem 
—another psychologist can come along and take the very same tests, 
even the very same scores. Without altering any of the “‘informa- 
tion” which they contain, and without abating their character of 
hierarchy, he can, merely by rearranging them, make anyone who 
before was a genius now appear a moron and vice versa! 

Faced by this prospect of g being only ‘“‘relative to the set up,” 
the present writer sought for any circumstance whereby it might be 
averted. Such a saving circumstance, he thought, might perhaps be 
found in the fact that Wilson’s transformation permitted any of the 
weights to have negative values, so that the primary scores could 
in the derived ones not only be added together, but also subtracted 
from one another. 

Now, against the positive values and the addition there seemed 
nothing to urge; no objection could be raised to the conception of 
a complex ability composed of more elementary abilities or ability- 
factors. This conception is supported by universal practice of mental 
testing; everywhere scores in tests are obtained by adding together 
the elementary scores got in the sub-tests. 

But the negative weighting and consequent subtracting appeared 
in a much less favorable light. Common practice affords little 
precedent for the subtraction of the score made in one test from that 
made in another. Such a negative weight does not merely mean that 
the score bears the negative sign (as occurs, for instance, when an 
ability is measured by the fewness of mistakes made). It means 
rather that the ability when combined with other abilities has the opposite 
sign to what it has when not so combined. If, for instance, a person 
gets a high mark for memory of faces, this will only pull him down 
if he happens to be also tested for memory of names! Moreover the 
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absurdity remains when we consider, not the entire score made at a 
test, but the two separate factors general and specific. It seems quite 
irrational that a person’s high ability of either general or specific 
kind as reckoned from a single test-score should count against him 
in reckoning general and specific abilities from this score in conjunction 
with others. An ability score does not by the mere concomitance 
of other scores become converted into a score of disability. But to 
meet this plea of mine against negative weights and the consequent 
subtraction, Wilson with his characteristic thoroughness went into 
the question further still. Putting for simplicity of illustration n = 2, 
he showed that, in order to keep g the same for the derived as for the 
primary variables—‘ to conserve g,’”’ as he expresses it—the following 
equation must be satisfied; this, he said, becomes impossible if all 
the W’s are positive. 


0 = WiuWaill as 7» 0) + Wi2Woe(1 et Ty 0) (B) 


Following up this result he writes: 


Spearman’s psychological postulate that we may not subtract scores prevents 
us from making those linear combinations of the tests which might conserve g 
and hence, relying on his postulate that g is definite we come to the theorem that 
we cannot add scores either. We may make no combinations of tests. Is this a 
paradox? Idonotknow. And if it be a paradox how are we to get out of it?! 


By this new mathematical theorem the confusion seemed to become 
worse confounded; the whole topic of positive and negative weights 
in compounding abilities stood in more urgent need than ever of 
fundamental clarification. Current literature, however, proffered 
disappointingly little assistance. Even the excellent work of Irwin 
only says that “if we insisted on positive weights,” then “a further 
limitation would occur.” What further limitation ?? 

Being thus thrown on our own resources, let us set out with the 
following system of equations, which is more general than that of 
Wilson, but becomes identical with it in the special case that n and m 
are equal. 

dq, = Wupit::* + W inDn, 
dg = Wap+:*: + W onDny 


a) a eee ee ee. Or ere SS 


(C) 





* This Journau, Vol. XX, 1929, p. 218. 
* Brit. J. Psychol., Vol. XXIII, 1933, p. 375. 
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Now, if these d’s are, like the p’s, hierarchical, and whether or not 
the derived g is the same as the primary one, we can from (C) deduce 
for every pair of d’s, say d, and de, the following equation (see Appen- 
dix IT). 


0 = (Wuku ere ae W inE in) (Waka + °° tf W anE on) 
+ WiuWakiuka + ° + > + WinWanFinFon (D) 


where W;; denotes the weight given to the primary variable j in the 
derived variable i, whilst the H’s and F’s are the respective weights 
of the primary general and specific factors in the derived specific 
factors. 

In the first place let us not impose on this equation (D) any further 
condition, save only that m must equal n; this much is required to 
make the case one of the “transformation” discussed by Wilson. 
We may note at once that the number of values at our disposal (the 
W’s, E’s and F’s) amounts to 3n?. But the number of independent 
conditions to be satisfied is only n(n — 3)/2 (see Appendix II). 
Hence, satisfaction of (D) is possible with degrees of freedom amount- 
ing to the difference between these two numbers; that is, n(5n — 3)/2. 
Thus the theoretically possible forms of satisfaction amounts to 
infinity of high order. Such a conclusion, of course, fully corroborates 
Wilson. 

Still, immense as is this number of forms in which (D) admits of 
being satisfied, nevertheless we see from the preceding figures that 
it vanishes in comparison with the number of forms which do not 
satisfy it. Accordingly, although in pure mathematics the satisfac- 
tion is without doubt possible, still in statistics the expectation of 
its occurrence is altogether negligible; to hope that by sheer dint of 
good luck all the positive W’s, E’s and F’s in (D) might exactly 
balance out all the negative ones, and further that this strange coinci- 
dence might recur in every one of the n(n — 1)/2 different equations 
of this type, would indeed be futile. Moreover it is very important 
to note that this statistical impossibility holds, not only when the 
variables in a set are put together at random, but even when they 
are deliberately selected from all actual observations. For the number 
of observations must at any rate be finite, whereas the chance of the 
balancings is infinitesimal. If such an array of balancings ever did 
happen, they could not be explained either by chance, nor even by 
deliberate selection, but only by being all brought under some single 








oo = = a eS eek YS — 





Uniqueness of G 147 


principle. And no such unifying principle would seem to have as yet 
been suggested. 

In the two preceding paragraphs we have considered the results 
of (D) when this is subject to no condition, save the necessary one 
that m = n. Now let us bring in the further condition which has 
been so much discussed, that no influential number of the W’s, EZ’s or 
F’s should be negative. Hereupon, the prospect of satisfying (D) 
becomes more hopeless than ever; for this time, obviously, all balancing 
is impossible not only in statistics but even in the most rigorous 
mathematics. There might seem to remain yet a way along which 
escape from tke difficulty could be attempted; the satisfaction of (D) 
might be sought by taking the variables at our disposal (W’s, E’s, 
and F’s) to have among them an overwhelming number of zeros. 
And seemingly we could after this fashion eliminate at any rate the 
lower part of (D); for this part looks as if it could be made to vanish 
by leaving in every column of (A) only a single W positive, all the rest 
being zero. But even along this line we are baffled; for it would make 
m less than n (see Appendix III), which conflicts with our fundamental 
assumption. 

Let us pass on to still another condition; the most important of 
all for our present purposes; let us consider the effect of conserving g. 
In current literature, such a conservation seems to have been supposed 
to render the transformation much harder. But our (D) shows the 
contrary; for on conserving the g, all the E’s become zero, so that at 
any rate the upper and usually far larger part of (D) is eliminated, 
But then, as may easily be seen, the F’s cannot also be zero, so that 
once more the fulfilment of (D) is precluded. 

On the whole, then, whenever negative weights are not allowed 
in large enough measure to exactly balance the positive ones, then 
transformation from one hierarchy to another becomes absolutely impossi- 
ble; this holds not only in statistics, but even in pure mathematics; itis 
true whether or not g be conserved. Consequently when Irwin describes 
the insistence on positive weights as a “further limitation” he would 
seem to understate the case. And as for the theorem of Wilson, 
startling as this seemed even to himself, we have here entirely verified 
it; along with him, we can say that, if there is no subtraction of scores, 
neither can there be any addition of them. 

After all this about transformation, however, there still remain 
to be considered those systems of linear functions which do not belong 
to this category; in particular, those where m is less than n. Previous 
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writers on g would appear to have left such systems without mention, 
as if not belonging to their study. But unfortunately they have often 
used language which convey the impression of generalizing their 
theorems to these systems no less than to the transformative ones. 
And anyhow, the case of m being less than n would seem quite impor- 
tant enough to merit explicit consideration. 

Now, one result of this case is that here the lower part of (D) 
' does admit of reduction to zero (see Appendix III). And there is 
no reason why this elimination of the lower part by making m less 
than n should not be combined with the elimination of the upper 
part by conserving g. In this manner we do at last come upon a 
case where addition is introduced although subtraction is forbidden. 
And behold it is just the case which occurs in common practice! 
Testers do continually combine many part-tests into a single total 
test; but they do not wittingly put one and the same part-test into 
more than one total test. Herewith the “paradox” of Wilson would 
seem to be solved. 

Of the whole, then, our scrutiny of the ‘‘transformation” so pro- 
foundly conceived and handled by Wilson has, in the sphere of abilities 
at any rate, had a happy issue. We need no longer fear that, when 
any psychologist has measured g, a second one may come along and, 
merely by reshuffling the very same scores, produce another g unlike 
the former. Such an upsetting shuffle has here been shown to be 
theoretically impossible—not to mention that actually not the hardiest 
psychologist has ever effected or even attempted it. 

But let us not fail to append the warning that in other spheres 
(even psychological) transformation may perhaps play a very different 
réle. 


3. EXACTITUDE OF MEASUREMENT 


To the preceding grounds on which g has been charged with lack 
of uniqueness must be added another one of considerable importance; 
the discovery that every measurement of g has an indeterminate 
component. In symbols, g = r + 1, where only r is determinate, not 7.’ 

The chief difficulties raised by this fact, however, were eliminated 
from the very beginning by proof that the indeterminate component 
could experimentally be made as small as desired. Recently, the 
matter has been treated in a more fundamental manner by introducing 


1 Spearman: Proc. Roy. Soc., A, CI, 1922, pp. 279-281. 
Wilson: Proc. Nat. Acad. Sciences, Vol. XIV, No. 3, 1928. 
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a distinction between indeterminateness of two unlike kinds. On 
the one hand, there is what may be regarded as a mere lack of exact- 
ness; this, of course, affects in greater or less degree all measurements 
whatever; but it does not affect the object to be measured. For 
instance, the distance between two points may be measured several 
times with results that more or less change; still there may be no 
reason to suppose any change in the distance itself. The other kind 
of indeterminateness is more basal and does affect the object to be 
measured. As an instance, the measuring of the length of a person’s 
head; here variation may occur, not only in making the actual measure- 
ment, but even in conceiving what constitutes the “length” of a head; 
this length, then, may reasonably enough be charged with lack of 
uniqueness. 

Now, the indeterminateness with which we are here concerned 
has turned out to be the former of the two preceding kinds; that is 
to say, it is only a defect of exactness; in fact, it is nothing else than 
the ordinary error of a regression equation. It leaves intact the ‘“‘true’’ 
g, which does admit of definition in such wise as to become perfectly 
determinate. ! 

More reasonable would appear to be the charge of indeterminateness 
of g on the ground that the measurement of any person will vary 
according to the population in which he happens to be included. For 
the value of any individual’s g is obtained in the following manner: 


Ges = Tig’ te 


where ¢, is the score of the individual z at the test ¢, r,, is the correla- 
tion of this test with the true value, and g, is the valuation obtained 
for the g of the individual. We can at once see that r,, and therefore 
gz can present considerable variations for different populations in 
which the individual may chance to find himself. 

But here the trouble seems to lie in nothing more than an equivoca- 
tion; namely, between the true value of g and the most advantageous 
estimate to be obtained by statistical devices. The preceding g- 
is only the estimate; this does indeed vary according to the informa- 


tion on which it may be based. But all the time the true value remains © 


immutable. 
To conclude, g has been charged with lack of determinateness 


simply and sheerly because of its error of measurement being very 
large. 





+ Spearman: The Uniqueness and the Exactness of “‘g.” Brit. J. Psych., 1933. 
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Now, the largeness of the error in most determinations of g was 
already stressed by the present writer.1_ But recently, such an 
authoritative mathematician as Piaggio has gone far beyond this; 
he says that even in the best work up to date the value obtained is 
so full of error as perhaps not to deserve the name of measurement at 
all.? 

Without quibbling at the word ‘‘measurement,” let us turn to the 
‘actual facts. We find that the degree of accuracy of a measurement 
of g can be brought to expression in at least three different ways. 
There is its standard deviation of sampling (its “probable error” 
comes to the same thing). There is its variance of sampling (which 
is the square of the standard deviation). And there is its correlation 
with the true value. Let us see what these three expressions come to, 
for example, in the recent monumental work of Brown and Stephen- 
son. They are: 








Standard deviation, as Variance, as Correlation 
per cent of obtained value} per cent of obtained value with true value 
25 6 .97 











Now most people, it seems to me, would on looking at the first expres- 
sion think the measurement very bad; but if instead he saw the second 
expression, he would find it quite tolerable; and if instead of either he 
came upon the third expression, he would go on his way rejoicing. 
The moral seems to be that to say whether an error is large or 
not is very much a subjective matter. All really depends on the 
purpose to which the obtained value is to be applied. And if this 
value correlates with the true one by as much as .97, then surely it 
will be useful for purposes in plenty. Anyway, it is incomparably 
better than any measurement of an IQ that I have ever met; in 
these, usually, the correlation of obtained with true values cannot be 
taken as more than about .70 to .80. Indeed, it may even surpass 
many modern physical measurements; for example, the determinations 
of the fundamental magnitude e has in some of the best recent work 
shown a mean variation amounting to no less than 30 per cent. 





1 The Abilities of Man, 1927, pp. xvii—xviii. 

* Brit. J. Psych., Vol. XXIV, 1933. Also same author, at the meeting of The 
Brit. Ass. Adv. Science, Leicester, 1933. 

2 Brit. J. Psych., 1933. 
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To summarize the present article, our survey has failed to find any- 
thing which justifies the charge against g of not being “unique.” 
Most of the reasons assigned for such a charge have proved to be 
quite invalid. The remaining grounds urged have turned out to 
warrant no charge more serious than that of an inexactitude which is 
usually far exceeded in other psychological results, and sometimes 
even in physics. 

APPENDIX I. IDENTIFICATION OF G BY REFERENCE VALUES 











General factors 
Variables 
First set Second set 
v); 91 
V2 gi 
Un-1 91 g2 
Un gi g: 
Vn+1 ++ g2 
Vnim o 92 











Let there be two hierarchical sets of variables having as their 
respective general factors g; and ge. Since the first set is hierarchical, 
all that is common to any two or more of them, say v,-: and »,, will 
be gi. But again if these same two or more variables belong also 
to the second set, then all that is common to them is go. Hence, g: 
and g2 must be one and the same. 


APPENDIX II. CONDITIONS OF HIERARCHY IN LINEAR FUNCTIONS 


As is well known, whenever any set of variables is “hierarchical” 
in the sense of having zero tetrads, there ensues the following system 
of equations. 


P: = Toe9Jr + Kp ,8p) 
P2 = Tow Jp + Kp0,8p (1) 


Pn - Tpn0 9p + ky.9,8p; 


where the p’s are the said variables, whilst the 7r’s, g’s and s’s have 
their customary meanings in the present reference, and furthermore 
k* stands for 1 — r°. 

From the preceding set of variables, which we will call ‘‘ primary,”’ 
let us derive another set which are linear functions of those preceding 
and so may be written as follows: 
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dy as Wirpr + af Serie + W inPny 
deg = Wapit +--+: + WeanPn, (2) 


dim = W miP1 °°? + W aaBas 


where the d’s are the derived variables and the W’s are the coefficients 
or weights by which the primary variables are respectively multiplied. 
These d’s also we will assume to be hierarchical, and we will proceed 
to consider what conditions are thereby involved. 

To begin with, we can write the further equations 


d, = Tap Ma + kay Say 
de = Taya + Kay Say (3) 


din = 1Td,0d4 + kag eSdu 
Substituting from (1) in (2) we get: 


d, = W utp 9p + W istkp 8p, — et W in p.0,9 a W inky.o 8px) 
de = WartpwJp + Warkpo8r, + °° * + Weanrp.0,9n + Wankp.o,8p,1 
: (4) 


a ee SR ee ce Te ee ee ee ee a oe ee eee ee A ee ee ee ee oe ee ee ee ee ee 


Now, the left hand term on the right side of each equation in (3) 
can be regarded as made up of portions of the 2n terms on the right 
side of the corresponding equation in (4). Then each right hand term 
on the right side of each equation in (3) will be made up of the com- 
plementary portions of the terms on the right side of (4). For instance 
we shall have: 


Taga = WuRurow9r t°°° + W inRin’ p09 
+ W 1K urrpw,8p, inti W inK inkp,.98pqs (5) 
and 
Kaa; = Will es Ris)? p09 p +--> + W,,(1 - Rin)? p,0,9p 
+ Will — Kis)kpw,8p, + °° * + Win(l — Kin)kp,o,8p,s (6) 


where the R’s and the K’s indicate how much of each term on the right 
side of (4) has been taken for each term on the right side of (3). ' They 
are taken here as of the nature of weights, or perhaps better ‘‘sub- 
weights.””’ As Wi: is the weight of p: in di, so are Ry, and Ky, taken 
to be the respective weights of rp.,,gp and kp.» 8p, in pi. Otherwise 
expressed, we assume that, as in (4) d is a linear function of the primary 
elementary variables g, and the s,’s, so the two parts of d are linear 
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functions of these same elementary variables. Into the limitation 
introduced by this assumption we shall not enter here. 
For brevity, let us write the right side of (6) as 


(Wiki + °° + + WinEin)gn + (WirF iis, + °° + + WinF indp,). 
(7) 


If then we multiply this (7) by the analogous quantity for d:, the 
result will be some terms involving g,? and each s,? respectively together 
with some terms involving products of all these variables with one 
another. On summing for all individuals and dividing by their 
number, all the terms involving products will disappear (for by 
assumption these variables are uncorrelated). There ensues 


kapha teu, = (Wik + ** + + WinEin) (Waka + ** 
+ WanEon)o*,, + (WiuWak uPaio%,, + °° ° 


+ WiWak iF 2107s, ). (8) 


Finally, since by assumption the correlations between the d’s are 
hierarchical, the sg, and sg, must be uncorrelated. That is to say, 
making as usual the sigmas equal to unity, we arrive at 


0 = (Wiki t+ +++ + WinEHon)(Waker + * + + + WonE on) 
+ (WuWaFuFa + +++ + WinWonFinFon). (9) 


The number of such equations between the factors sg is obviously 
m(m — 1)/2. All however come from the equations involving the 
correlations between the entire d’s, which are also m(m — 1)/2 in 
number, but can be expressed in terms of m independent variables.* 
Hence the number of independent conditions imposed by all the 


equations like (9) will be = 1) —m= —_ 3) | 


APPENDIX III. POSSIBILITY OF ADDITION WITHOUT SUBTRACTION 


Suppose that in the lower part on the right of (9) all the F’s are 
positive and none of the W’s negative. Then, in order that this part 
should reduce to zero, it is necessary that no column in (2) should 
contain more than a single positive w. But in that case there cannot 
be more than n positive w’s in the whole table. On the other hand, in 
order that there should be any addition, at least two w’s must be 
positive in the same row, so that in the whole table there cannot be 
less than m+ 1 positive W’s. Hence addition is possible if m is 
less than n; not otherwise. 











1 Spearman: Proc. Roy. Soc., A, CI, 1922. 




























os 
ae 


—— $ > 


ETS 
- 


i 4 
‘y ot 4, 
R ts, 

‘ "] 
i 
eon 
val Fe ad 
4 
ae 
Se 
s me? 
> Boy 
Cory 
’ nS 
. ia 
he 
Bist 
, = 
A Sel 
, & 
d 
eet 
eee 
‘| 
i 
; > 
a 
% 
ra 
cm 
t- 
er! 
f 
wie AS 
{ 
4. 


eS art 


RSS MMEEE OE SS 


ere, 


ee a 
GREE BAB. * 


~~ S PRD 
: Sag. pO~sz 
Seah 





READING RATE AND COMPREHENSION ACCURACY 
AS DETERMINANTS OF READING TEST SCORES 


F. P. ROBINSON AND F. H. McCOLLOM 


University of Iowa 


The purpose of this study is to evaluate the relative importance 
of rate of reading and accuracy of comprehension in determining 
reading test scores. This is to be done by analyzing and comparing 
the reading abilities of good and poor readers so as to note which 
of these two variables most clearly differentiates high and low scorers 
on reading tests. 

These are semi-independent variables. In any one reader they 
tend to be inverse functions of each other with a particular relation- 
ship being maintained in each individual’s normal reading. However, 
when one individual is compared with another it is found that the 


faster reader is usually the more accurate (see Table I). Also with 


proper techniques it is possible to improve either function while the 
other is being held constant. Thus it will be interesting to note 
the effect of these as independent variables on reading and in so far 
as they are functions of each other, to determine the relationship 
maintained between them by superior readers. 

The group of good readers consisted of thirty-seven college fresh- 
men who scored in the highest fifteen per cent on the Iowa Silent 
Reading Test battery while the group of poor readers consisted of 
thirty-three freshmen who scored in the lowest fifteen per cent on 
the same test. The analysis of the Paragraph Comprehension test 
of this battery into the number of questions attempted as a measure 
of rate and into the per cent of the questions attempted that were 
correct as a measure of comprehension accuracy showed that the good 
readers were superior to the poor readers in both variables. However, 
the critical ratios? indicate that the difference was greater in rate of 
reading than in comprehension accuracy. These ratios were 15.11 
and 6.86, respectively. (See Table I. In the first two columns, the 
top number of each pair is the mean, the bottom number is the stand- 





1 Robinson, F. P.: ‘‘The Réle of Eye Movements in Reading with an Evalua- 
tion of Techniques for Their Improvement.” Univ. of Iowa Studies: Series on 
Aims and Progress of Research, No. 39, 1933. 

Ae ‘ M,—M; 
* Critical ratio = SE Difference 
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TaBLE 1.—SHOWING THE Group SCORES AND THE RELIABILITY OF THE DIFFERENCE 





Good Poor Critical 





tom readers | readers ratio 
Measures of rate: 

1.8.R., I (questions attempted).............. 27 .60 14.39 15.11 
1.82 4.72 

Van Waggenen “A” (time).................. 21.85 39.68 9.85 
4.41 9.50 

Van Waggenen Alpha (time)................ 18.47 37.74 9.99 
4.78 9.82 


Measures of comp. accur.: 
Ries © We CUI cece ccc tcgncecccvens 89.15 | 67.96 | 6.86 
- 6.09 16.77 


Van Waggenen ‘‘A” (score)................. 93 .32 82.59 5.37 
10.18 6.35 


Van Waggenen Alpha (score)................ 94.51 83 .27 5.83 
10.10 5.27 














ard deviation. The third column indicates the reliability of the 
difference. ) 

Because the number of questions attempted in a time test cannot 
be accurately determined (some may be tried but left unmarked as 
too difficult), both groups were given two other reading tests which 
do not have time limits. These were the Van Waggenen Scales for 
English Literature, Forms A and Alpha. The former measures 
“the ability to comprehend what is read” and the latter, “‘the ability 
to interpret what is read.”! The time taken to finish each test was 
noted but the students were told that the primary purpose of the test 
was to determine how well they could comprehend what they read. 

The Van Waggenen tests have certain advantages for our problem. 
Their scores measure comprehension accuracy alone and have a high 
reliability (.94). Each test took an average of twenty minutes for 
superior readers and thirty-nine minutes for poor readers to finish 
and so gives a reliable measure of reading rate. Because Van Wag- 
genen has statistically evaluated the comprehension difficulty of 
each paragraph and arranged them as a power test, these tests approach 





1 Catalog of Standardized Tests for High School and College. Bloomington 
Public School Publishing Company, 1930. 
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the nearest of any on the market to measuring ‘“‘depth of comprehen- 
sion or meaning.”’ Since the material used in these tests is selected 
from textbooks and is presented in book-like form and since the tests 
have no time limit, they may be considered as good measures of the 
ability to comprehend class assignments. 

The comparison of good and poor readers in rate of reading and 
comprehension accuracy on the two Van Waggenen Scales shows that 
again the good readers were superior in both variables, but that 
again the greatest differences were in rate of reading (Table I). 

The relative importance of rate and accuracy in determining 
reading test scores can be further illustrated by indicating the per- 
centage of poor readers who did as well as superior readers in each 
of these variables. The percentages of poor readers who had com- 
prehension accuracy scores that fell within the range of that shown 
by superior readers on the Iowa Silent Reading Test, Part I, Van 
Waggenen Scale A and Scale Alpha were forty-two, one hundred, and 
-ninety-four per cent, respectively. This suggests that a large majority 
of poor readers have accuracy coefficients which a superior reader 
* might have, although the latter’s general average is higher. On the 
other hand, the percentage of overlap in rate of reading was much 
smaller. These percentages on the three tests were nothing, twenty- 
— and thirty-two per cent, respectively, which means that a few 
! poor readers read as fast as the slowest superior readers. 

- These results indicate that people who score highly on typical 
time-limit reading tests excell both in speed and accuracy but that 
the former is the greater determinant of their test scores. 

A qualitative analysis of the comprehension of these good and poor 
readers indicates some further points of interest. It is stated! that 
a given score on the Van Waggenen tests indicates the ability to 
answer fifty per cent of the questions at that level of question difficulty, 
while a score ten points higher indicates the ability to answer seventy- 
five per cent of the questions at the former level. Since the good 
readers were approximately 11 points better than the poor readers 
on both tests, the conclusion may be drawn that good readers should 
answer a little over twenty-five per cent more of the questions than 
poor readers at this level of difficulty. On the Iowa Silent Reading 
test the superior readers were twenty-one per cent more accurate than 
the poor readers. However, the good readers were not uniformly 





1 Class Record Sheet for Van Waggenen Scales. Bloomington, Public School 
Publishing Company. 
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superior’ to the poor readers throughout the Van Waggenen tests. 
Since these power tests are divided into three sections, each part 
being an average of ten scale points (twenty-five per cent) greater 
difficulty than the section before it, they make a good measure of 
differences in depth or difficulty of comprehension. The good readers 
excelled the poor readers on all three sections, but increasingly so on 
the more difficult sections. (The critical ratios for the differences 
between the two groups on each part of the scales were as follows: 
Scale A Part I, 2.38; II, 3.65; III, 6.86; Scale Alpha Part I, 3.35; 
II, 4.38; III, 5.97.) Thus good readers have a greater depth of 
comprehension than poor readers, but it is doubtful if most reading 
tests fully measure these more difficult levels. 

These results should not be interpreted as indicating that speed 


of reading or verbal memory are more desirable in the life’s work ~ 


than are accuracy or depth of comprehension, but, to the contrary, 
they represent criticisms of our present reading tests. However, an 
analysis of the factors determining school success does indicate that 
both comprehension accuracy and reading rate are important. 

The conclusions of this study are: First, although good readers 
are superior to poor readers in both rate of reading and accuracy of 
comprehension, efficiency in the former is the greater determinant 
of their reading test superiority. Second, good readers show a greater 
degree of superiority to poor readers on questions covering more 
difficult comprehension levels than on questions concerning verbal 
memory. It is to be noted, however, that reading tests probably 
do not tap depth of comprehension very much. These conclusions 
represent criticisms of present reading tests and not a description of 
the most valuable type of reading. 
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BOOK REVIEWS 


PuiLurr Justin Ruton. The Sound Motion Picture in Science Teach- 
ing. Cambridge: Harvard University Press, 1933, pp. IX + 236. 


The author summarizes the important studies dealing with the 
use of the film in instruction and concludes that the pedagogical 
effectiveness of the moving picture lies in its ability to stimulate and 
clarify certain types of experiences. Furthermore, he points out 
that the film presented with running commentary consistently shows 
itself to be more effective than the same film without it. However, 
as the author has revealed in his analysis, the reports on the use of 
the film fail to give convincing evidence that it has been satisfactorily 
evaluated as an instructional device. 

This study is concerned with the evaluation of the educational 
effectiveness of the sound picture in the teaching of certain units 
taken from a course in general science. 

The results are based on a carefully controlled experiment with 
2860 ninth-grade pupils of three different school systems in Mas- 
sachusetts. The period of training extended over a period of thirty 
days or six school weeks, at the end of which time, immediate recall 
tests were administered to check on both rote and relational learning. 
Three and one-half months after the period of training, retention 
tests were administered. 

The results of the immediate recall tests indicated that the teaching 
technique employing the sound moving picture was 20.5 per cent more 
effective from the instructional standpoint than was the unaided 
presentation. 

In terms of retention, the results of the experiment showed a 
38.5 per cent greater retained gain for the sound Film Group than for the 
Control Group. Interpreted in terms of the standard error, the film 
technique was twenty per cent (minimal index) superior to the control 
technique. 

Interpreted in terms of school costs the introduction of the sound 
film, according to the author, would result in an informational gain 
during five weeks of use at least equal to that obtainable in six weeks 
without its adoption, if conditions closely approximating those of the 
experiment were adhered to. 

In the language of the author, “‘The one conclusion which does 
seem inevitable from the results for the two types of subject-matter 
is that whenever differences occur either in the need for illustration, 
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the appropriateness of the motion-picture technique, or the extent 
to which the particular film meets the need for which it is designed, 
large differences may be expected in the efficacy of the film. The 
educational motion picture does not derive its effectiveness from being 
called educational.” 

In this study, Mr. Rulon has made a significant contribution to the 
field of general educational practice. The technique seems adequate 
and should serve as a prototype for those interested in research of the 
character employed in this investigation. The manner in which the 
test elements were determined should be stimulating to research 
workers in the various subject-matter fields. Ropert G. Simpson. 

Carnegie Institute of Technology. 


Wittram 8S. Gray. Improving Instruction in Reading. Supple- 
mentary Educational Monographs, No. 40. Chicago: University 
of Chicago, 1933, pp. XII + 226. 


Extensive investigations in the psychology of reading have yielded 
a comprehensive understanding of the reading process. Application 
of these findings to the teaching of reading has not been easy or direct. 
Studies especially designed, therefore, to discover adequate methods 
of improving reading in the classroom situation are needed. The 
large number of such investigations appearing since 1915 reveals the 
emphasis now placed upon this type of approach. 

This monograph reports another important contribution on 
methods of improving instruction in reading. The study was con- 
cerned with elementary school pupils in rural, village and city schools. 
Prior to the beginning of the experiment, reading instruction in these 
schools varied from a formal limited type of teaching which was 
relatively ineffective, to a highly efficient program which extended 
activities of the reading period to all school subjects. The reor- 
ganized program had most effect where poor methods had been 
employed previously, and the greatest improvement usually came at 
the end of two or three years use of the reorganized procedure rather 
than at the end of one year. An important gain achieved was the 
enrichment of reading experience through the independent reading of 
an increased number of books. 

Three distinct contributions are made in the study: (1) Effective 
ways and means of reorganizing and improving the teaching of reading 
in harmony with the results of scientific investigations are outlined. 
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(2) The character of the chief difficulties met in efforts to reorgani 
and improve reading were determined and methods by which man 
of the difficulties were eliminated are given. (3) The effect of thes 
constructive changes in teaching on reading achievement was show 
to be considerable. 

Interpretation of group differences and the amount of gains dis 
covered would have been greatly facilitated if the author had included 
(1) Number of cases used in each group, (2) a measure of variability 
with each average or median, and (3) the ratio of the difference tc 
the standard error of the difference in the more important com 
parisons. Inability to control conditions effectively in some instance; 
was unfortunate. 

This investigation is one of the best of its kind. The carefulness 
and skill with which the study was organized and executed may we 
serve as an example to other workers in the field. 


Mies A. TINKER. 





University of Minnesota. 
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